Cargando…
Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets
BACKGROUND: The scale and diversity of metagenomic sequencing projects challenge both our technical and conceptual approaches in gene and genome annotations. The recent Sorcerer II Global Ocean Sampling (GOS) expedition yielded millions of predicted protein sequences, which significantly altered the...
Autores principales: | , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2008
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2557142/ https://www.ncbi.nlm.nih.gov/pubmed/18846219 http://dx.doi.org/10.1371/journal.pone.0003375 |
_version_ | 1782159632310992896 |
---|---|
author | Li, Weizhong Wooley, John C. Godzik, Adam |
author_facet | Li, Weizhong Wooley, John C. Godzik, Adam |
author_sort | Li, Weizhong |
collection | PubMed |
description | BACKGROUND: The scale and diversity of metagenomic sequencing projects challenge both our technical and conceptual approaches in gene and genome annotations. The recent Sorcerer II Global Ocean Sampling (GOS) expedition yielded millions of predicted protein sequences, which significantly altered the landscape of known protein space by more than doubling its size and adding thousands of new families (Yooseph et al., 2007 PLoS Biol 5, e16). Such datasets, not only by their sheer size, but also by many other features, defy conventional analysis and annotation methods. METHODOLOGY/PRINCIPAL FINDINGS: In this study, we describe an approach for rapid analysis of the sequence diversity and the internal structure of such very large datasets by advanced clustering strategies using the newly modified CD-HIT algorithm. We performed a hierarchical clustering analysis on the 17.4 million Open Reading Frames (ORFs) identified from the GOS study and found over 33 thousand large predicted protein clusters comprising nearly 6 million sequences. Twenty percent of these clusters did not match known protein families by sequence similarity search and might represent novel protein families. Distributions of the large clusters were illustrated on organism composition, functional class, and sample locations. CONCLUSION/SIGNIFICANCE: Our clustering took about two orders of magnitude less computational effort than the similar protein family analysis of original GOS study. This approach will help to analyze other large metagenomic datasets in the future. A Web server with our clustering results and annotations of predicted protein clusters is available online at http://tools.camera.calit2.net/gos under the CAMERA project. |
format | Text |
id | pubmed-2557142 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2008 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-25571422008-10-10 Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets Li, Weizhong Wooley, John C. Godzik, Adam PLoS One Research Article BACKGROUND: The scale and diversity of metagenomic sequencing projects challenge both our technical and conceptual approaches in gene and genome annotations. The recent Sorcerer II Global Ocean Sampling (GOS) expedition yielded millions of predicted protein sequences, which significantly altered the landscape of known protein space by more than doubling its size and adding thousands of new families (Yooseph et al., 2007 PLoS Biol 5, e16). Such datasets, not only by their sheer size, but also by many other features, defy conventional analysis and annotation methods. METHODOLOGY/PRINCIPAL FINDINGS: In this study, we describe an approach for rapid analysis of the sequence diversity and the internal structure of such very large datasets by advanced clustering strategies using the newly modified CD-HIT algorithm. We performed a hierarchical clustering analysis on the 17.4 million Open Reading Frames (ORFs) identified from the GOS study and found over 33 thousand large predicted protein clusters comprising nearly 6 million sequences. Twenty percent of these clusters did not match known protein families by sequence similarity search and might represent novel protein families. Distributions of the large clusters were illustrated on organism composition, functional class, and sample locations. CONCLUSION/SIGNIFICANCE: Our clustering took about two orders of magnitude less computational effort than the similar protein family analysis of original GOS study. This approach will help to analyze other large metagenomic datasets in the future. A Web server with our clustering results and annotations of predicted protein clusters is available online at http://tools.camera.calit2.net/gos under the CAMERA project. Public Library of Science 2008-10-10 /pmc/articles/PMC2557142/ /pubmed/18846219 http://dx.doi.org/10.1371/journal.pone.0003375 Text en Li et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited. |
spellingShingle | Research Article Li, Weizhong Wooley, John C. Godzik, Adam Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets |
title | Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets |
title_full | Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets |
title_fullStr | Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets |
title_full_unstemmed | Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets |
title_short | Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets |
title_sort | probing metagenomics by rapid cluster analysis of very large datasets |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2557142/ https://www.ncbi.nlm.nih.gov/pubmed/18846219 http://dx.doi.org/10.1371/journal.pone.0003375 |
work_keys_str_mv | AT liweizhong probingmetagenomicsbyrapidclusteranalysisofverylargedatasets AT wooleyjohnc probingmetagenomicsbyrapidclusteranalysisofverylargedatasets AT godzikadam probingmetagenomicsbyrapidclusteranalysisofverylargedatasets |