Cargando…

SPA: a short peptide assembler for metagenomic data

The metagenomic paradigm allows for an understanding of the metabolic and functional potential of microbes in a community via a study of their proteins. The substrate for protein identification is either the set of individual nucleotide reads generated from metagenomic samples or the set of contig s...

Descripción completa

Detalles Bibliográficos
Autores principales: Yang, Youngik, Yooseph, Shibu
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3632116/
https://www.ncbi.nlm.nih.gov/pubmed/23435317
http://dx.doi.org/10.1093/nar/gkt118
_version_ 1782266841051168768
author Yang, Youngik
Yooseph, Shibu
author_facet Yang, Youngik
Yooseph, Shibu
author_sort Yang, Youngik
collection PubMed
description The metagenomic paradigm allows for an understanding of the metabolic and functional potential of microbes in a community via a study of their proteins. The substrate for protein identification is either the set of individual nucleotide reads generated from metagenomic samples or the set of contig sequences produced by assembling these reads. However, a read-based strategy using reads generated by next-generation sequencing (NGS) technologies, results in an overwhelming majority of partial-length protein predictions. A nucleotide assembly-based strategy does not fare much better, as metagenomic assemblies are typically fragmented and also leave a large fraction of reads unassembled. Here, we present a method for reconstructing complete protein sequences directly from NGS metagenomic data. Our framework is based on a novel short peptide assembler (SPA) that assembles protein sequences from their constituent peptide fragments identified on short reads. The SPA algorithm is based on informed traversals of a de Bruijn graph, defined on an amino acid alphabet, to identify probable paths that correspond to proteins. Using large simulated and real metagenomic data sets, we show that our method outperforms the alternate approach of identifying genes on nucleotide sequence assemblies and generates longer protein sequences that can be more effectively analysed.
format Online
Article
Text
id pubmed-3632116
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-36321162013-04-22 SPA: a short peptide assembler for metagenomic data Yang, Youngik Yooseph, Shibu Nucleic Acids Res Methods Online The metagenomic paradigm allows for an understanding of the metabolic and functional potential of microbes in a community via a study of their proteins. The substrate for protein identification is either the set of individual nucleotide reads generated from metagenomic samples or the set of contig sequences produced by assembling these reads. However, a read-based strategy using reads generated by next-generation sequencing (NGS) technologies, results in an overwhelming majority of partial-length protein predictions. A nucleotide assembly-based strategy does not fare much better, as metagenomic assemblies are typically fragmented and also leave a large fraction of reads unassembled. Here, we present a method for reconstructing complete protein sequences directly from NGS metagenomic data. Our framework is based on a novel short peptide assembler (SPA) that assembles protein sequences from their constituent peptide fragments identified on short reads. The SPA algorithm is based on informed traversals of a de Bruijn graph, defined on an amino acid alphabet, to identify probable paths that correspond to proteins. Using large simulated and real metagenomic data sets, we show that our method outperforms the alternate approach of identifying genes on nucleotide sequence assemblies and generates longer protein sequences that can be more effectively analysed. Oxford University Press 2013-04 2013-02-22 /pmc/articles/PMC3632116/ /pubmed/23435317 http://dx.doi.org/10.1093/nar/gkt118 Text en © The Author(s) 2013. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/3.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methods Online
Yang, Youngik
Yooseph, Shibu
SPA: a short peptide assembler for metagenomic data
title SPA: a short peptide assembler for metagenomic data
title_full SPA: a short peptide assembler for metagenomic data
title_fullStr SPA: a short peptide assembler for metagenomic data
title_full_unstemmed SPA: a short peptide assembler for metagenomic data
title_short SPA: a short peptide assembler for metagenomic data
title_sort spa: a short peptide assembler for metagenomic data
topic Methods Online
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3632116/
https://www.ncbi.nlm.nih.gov/pubmed/23435317
http://dx.doi.org/10.1093/nar/gkt118
work_keys_str_mv AT yangyoungik spaashortpeptideassemblerformetagenomicdata
AT yoosephshibu spaashortpeptideassemblerformetagenomicdata