Cargando…
SPA: a short peptide assembler for metagenomic data
The metagenomic paradigm allows for an understanding of the metabolic and functional potential of microbes in a community via a study of their proteins. The substrate for protein identification is either the set of individual nucleotide reads generated from metagenomic samples or the set of contig s...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2013
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3632116/ https://www.ncbi.nlm.nih.gov/pubmed/23435317 http://dx.doi.org/10.1093/nar/gkt118 |
_version_ | 1782266841051168768 |
---|---|
author | Yang, Youngik Yooseph, Shibu |
author_facet | Yang, Youngik Yooseph, Shibu |
author_sort | Yang, Youngik |
collection | PubMed |
description | The metagenomic paradigm allows for an understanding of the metabolic and functional potential of microbes in a community via a study of their proteins. The substrate for protein identification is either the set of individual nucleotide reads generated from metagenomic samples or the set of contig sequences produced by assembling these reads. However, a read-based strategy using reads generated by next-generation sequencing (NGS) technologies, results in an overwhelming majority of partial-length protein predictions. A nucleotide assembly-based strategy does not fare much better, as metagenomic assemblies are typically fragmented and also leave a large fraction of reads unassembled. Here, we present a method for reconstructing complete protein sequences directly from NGS metagenomic data. Our framework is based on a novel short peptide assembler (SPA) that assembles protein sequences from their constituent peptide fragments identified on short reads. The SPA algorithm is based on informed traversals of a de Bruijn graph, defined on an amino acid alphabet, to identify probable paths that correspond to proteins. Using large simulated and real metagenomic data sets, we show that our method outperforms the alternate approach of identifying genes on nucleotide sequence assemblies and generates longer protein sequences that can be more effectively analysed. |
format | Online Article Text |
id | pubmed-3632116 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2013 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-36321162013-04-22 SPA: a short peptide assembler for metagenomic data Yang, Youngik Yooseph, Shibu Nucleic Acids Res Methods Online The metagenomic paradigm allows for an understanding of the metabolic and functional potential of microbes in a community via a study of their proteins. The substrate for protein identification is either the set of individual nucleotide reads generated from metagenomic samples or the set of contig sequences produced by assembling these reads. However, a read-based strategy using reads generated by next-generation sequencing (NGS) technologies, results in an overwhelming majority of partial-length protein predictions. A nucleotide assembly-based strategy does not fare much better, as metagenomic assemblies are typically fragmented and also leave a large fraction of reads unassembled. Here, we present a method for reconstructing complete protein sequences directly from NGS metagenomic data. Our framework is based on a novel short peptide assembler (SPA) that assembles protein sequences from their constituent peptide fragments identified on short reads. The SPA algorithm is based on informed traversals of a de Bruijn graph, defined on an amino acid alphabet, to identify probable paths that correspond to proteins. Using large simulated and real metagenomic data sets, we show that our method outperforms the alternate approach of identifying genes on nucleotide sequence assemblies and generates longer protein sequences that can be more effectively analysed. Oxford University Press 2013-04 2013-02-22 /pmc/articles/PMC3632116/ /pubmed/23435317 http://dx.doi.org/10.1093/nar/gkt118 Text en © The Author(s) 2013. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/3.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Methods Online Yang, Youngik Yooseph, Shibu SPA: a short peptide assembler for metagenomic data |
title | SPA: a short peptide assembler for metagenomic data |
title_full | SPA: a short peptide assembler for metagenomic data |
title_fullStr | SPA: a short peptide assembler for metagenomic data |
title_full_unstemmed | SPA: a short peptide assembler for metagenomic data |
title_short | SPA: a short peptide assembler for metagenomic data |
title_sort | spa: a short peptide assembler for metagenomic data |
topic | Methods Online |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3632116/ https://www.ncbi.nlm.nih.gov/pubmed/23435317 http://dx.doi.org/10.1093/nar/gkt118 |
work_keys_str_mv | AT yangyoungik spaashortpeptideassemblerformetagenomicdata AT yoosephshibu spaashortpeptideassemblerformetagenomicdata |