Cargando…

Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets

BACKGROUND: The scale and diversity of metagenomic sequencing projects challenge both our technical and conceptual approaches in gene and genome annotations. The recent Sorcerer II Global Ocean Sampling (GOS) expedition yielded millions of predicted protein sequences, which significantly altered the...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Weizhong, Wooley, John C., Godzik, Adam
Formato: Texto
Lenguaje:English
Publicado: Public Library of Science 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2557142/
https://www.ncbi.nlm.nih.gov/pubmed/18846219
http://dx.doi.org/10.1371/journal.pone.0003375
_version_ 1782159632310992896
author Li, Weizhong
Wooley, John C.
Godzik, Adam
author_facet Li, Weizhong
Wooley, John C.
Godzik, Adam
author_sort Li, Weizhong
collection PubMed
description BACKGROUND: The scale and diversity of metagenomic sequencing projects challenge both our technical and conceptual approaches in gene and genome annotations. The recent Sorcerer II Global Ocean Sampling (GOS) expedition yielded millions of predicted protein sequences, which significantly altered the landscape of known protein space by more than doubling its size and adding thousands of new families (Yooseph et al., 2007 PLoS Biol 5, e16). Such datasets, not only by their sheer size, but also by many other features, defy conventional analysis and annotation methods. METHODOLOGY/PRINCIPAL FINDINGS: In this study, we describe an approach for rapid analysis of the sequence diversity and the internal structure of such very large datasets by advanced clustering strategies using the newly modified CD-HIT algorithm. We performed a hierarchical clustering analysis on the 17.4 million Open Reading Frames (ORFs) identified from the GOS study and found over 33 thousand large predicted protein clusters comprising nearly 6 million sequences. Twenty percent of these clusters did not match known protein families by sequence similarity search and might represent novel protein families. Distributions of the large clusters were illustrated on organism composition, functional class, and sample locations. CONCLUSION/SIGNIFICANCE: Our clustering took about two orders of magnitude less computational effort than the similar protein family analysis of original GOS study. This approach will help to analyze other large metagenomic datasets in the future. A Web server with our clustering results and annotations of predicted protein clusters is available online at http://tools.camera.calit2.net/gos under the CAMERA project.
format Text
id pubmed-2557142
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-25571422008-10-10 Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets Li, Weizhong Wooley, John C. Godzik, Adam PLoS One Research Article BACKGROUND: The scale and diversity of metagenomic sequencing projects challenge both our technical and conceptual approaches in gene and genome annotations. The recent Sorcerer II Global Ocean Sampling (GOS) expedition yielded millions of predicted protein sequences, which significantly altered the landscape of known protein space by more than doubling its size and adding thousands of new families (Yooseph et al., 2007 PLoS Biol 5, e16). Such datasets, not only by their sheer size, but also by many other features, defy conventional analysis and annotation methods. METHODOLOGY/PRINCIPAL FINDINGS: In this study, we describe an approach for rapid analysis of the sequence diversity and the internal structure of such very large datasets by advanced clustering strategies using the newly modified CD-HIT algorithm. We performed a hierarchical clustering analysis on the 17.4 million Open Reading Frames (ORFs) identified from the GOS study and found over 33 thousand large predicted protein clusters comprising nearly 6 million sequences. Twenty percent of these clusters did not match known protein families by sequence similarity search and might represent novel protein families. Distributions of the large clusters were illustrated on organism composition, functional class, and sample locations. CONCLUSION/SIGNIFICANCE: Our clustering took about two orders of magnitude less computational effort than the similar protein family analysis of original GOS study. This approach will help to analyze other large metagenomic datasets in the future. A Web server with our clustering results and annotations of predicted protein clusters is available online at http://tools.camera.calit2.net/gos under the CAMERA project. Public Library of Science 2008-10-10 /pmc/articles/PMC2557142/ /pubmed/18846219 http://dx.doi.org/10.1371/journal.pone.0003375 Text en Li et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Li, Weizhong
Wooley, John C.
Godzik, Adam
Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets
title Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets
title_full Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets
title_fullStr Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets
title_full_unstemmed Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets
title_short Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets
title_sort probing metagenomics by rapid cluster analysis of very large datasets
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2557142/
https://www.ncbi.nlm.nih.gov/pubmed/18846219
http://dx.doi.org/10.1371/journal.pone.0003375
work_keys_str_mv AT liweizhong probingmetagenomicsbyrapidclusteranalysisofverylargedatasets
AT wooleyjohnc probingmetagenomicsbyrapidclusteranalysisofverylargedatasets
AT godzikadam probingmetagenomicsbyrapidclusteranalysisofverylargedatasets