Cargando…

MegaGTA: a sensitive and accurate metagenomic gene-targeted assembler using iterative de Bruijn graphs

BACKGROUND: The recent release of the gene-targeted metagenomics assembler Xander has demonstrated that using the trained Hidden Markov Model (HMM) to guide the traversal of de Bruijn graph gives obvious advantage over other assembly methods. Xander, as a pilot study, indeed has a lot of room for im...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Dinghua, Huang, Yukun, Leung, Chi-Ming, Luo, Ruibang, Ting, Hing-Fung, Lam, Tak-Wah
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5657035/
https://www.ncbi.nlm.nih.gov/pubmed/29072142
http://dx.doi.org/10.1186/s12859-017-1825-3
_version_ 1783273806993817600
author Li, Dinghua
Huang, Yukun
Leung, Chi-Ming
Luo, Ruibang
Ting, Hing-Fung
Lam, Tak-Wah
author_facet Li, Dinghua
Huang, Yukun
Leung, Chi-Ming
Luo, Ruibang
Ting, Hing-Fung
Lam, Tak-Wah
author_sort Li, Dinghua
collection PubMed
description BACKGROUND: The recent release of the gene-targeted metagenomics assembler Xander has demonstrated that using the trained Hidden Markov Model (HMM) to guide the traversal of de Bruijn graph gives obvious advantage over other assembly methods. Xander, as a pilot study, indeed has a lot of room for improvement. Apart from its slow speed, Xander uses only 1 k-mer size for graph construction and whatever choice of k will compromise either sensitivity or accuracy. Xander uses a Bloom-filter representation of de Bruijn graph to achieve a lower memory footprint. Bloom filters bring in false positives, and it is not clear how this would impact the quality of assembly. Xander does not keep track of the multiplicity of k-mers, which would have been an effective way to differentiate between erroneous k-mers and correct k-mers. RESULTS: In this paper, we present a new gene-targeted assembler MegaGTA, which attempts to improve Xander in different aspects. Quality-wise, it utilizes iterative de Bruijn graphs to take full advantage of multiple k-mer sizes to make the best of both sensitivity and accuracy. Computation-wise, it employs succinct de Bruijn graphs (SdBG) to achieve low memory footprint and high speed (the latter is benefited from a highly efficient parallel algorithm for constructing SdBG). Unlike Bloom filters, an SdBG is an exact representation of a de Bruijn graph. It enables MegaGTA to avoid false-positive contigs and to easily incorporate the multiplicity of k-mers for building better HMM model. We have compared MegaGTA and Xander on an HMP-defined mock metagenomic dataset, and showed that MegaGTA excelled in both sensitivity and accuracy. On a large rhizosphere soil metagenomic sample (327Gbp), MegaGTA produced 9.7–19.3% more contigs than Xander, and these contigs were assigned to 10–25% more gene references. In our experiments, MegaGTA, depending on the number of k-mers used, is two to ten times faster than Xander. CONCLUSION: MegaGTA improves on the algorithm of Xander and achieves higher sensitivity, accuracy and speed. Moreover, it is capable of assembling gene sequences from ultra-large metagenomic datasets. Its source code is freely available at https://github.com/HKU-BAL/megagta .
format Online
Article
Text
id pubmed-5657035
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-56570352017-10-31 MegaGTA: a sensitive and accurate metagenomic gene-targeted assembler using iterative de Bruijn graphs Li, Dinghua Huang, Yukun Leung, Chi-Ming Luo, Ruibang Ting, Hing-Fung Lam, Tak-Wah BMC Bioinformatics Software BACKGROUND: The recent release of the gene-targeted metagenomics assembler Xander has demonstrated that using the trained Hidden Markov Model (HMM) to guide the traversal of de Bruijn graph gives obvious advantage over other assembly methods. Xander, as a pilot study, indeed has a lot of room for improvement. Apart from its slow speed, Xander uses only 1 k-mer size for graph construction and whatever choice of k will compromise either sensitivity or accuracy. Xander uses a Bloom-filter representation of de Bruijn graph to achieve a lower memory footprint. Bloom filters bring in false positives, and it is not clear how this would impact the quality of assembly. Xander does not keep track of the multiplicity of k-mers, which would have been an effective way to differentiate between erroneous k-mers and correct k-mers. RESULTS: In this paper, we present a new gene-targeted assembler MegaGTA, which attempts to improve Xander in different aspects. Quality-wise, it utilizes iterative de Bruijn graphs to take full advantage of multiple k-mer sizes to make the best of both sensitivity and accuracy. Computation-wise, it employs succinct de Bruijn graphs (SdBG) to achieve low memory footprint and high speed (the latter is benefited from a highly efficient parallel algorithm for constructing SdBG). Unlike Bloom filters, an SdBG is an exact representation of a de Bruijn graph. It enables MegaGTA to avoid false-positive contigs and to easily incorporate the multiplicity of k-mers for building better HMM model. We have compared MegaGTA and Xander on an HMP-defined mock metagenomic dataset, and showed that MegaGTA excelled in both sensitivity and accuracy. On a large rhizosphere soil metagenomic sample (327Gbp), MegaGTA produced 9.7–19.3% more contigs than Xander, and these contigs were assigned to 10–25% more gene references. In our experiments, MegaGTA, depending on the number of k-mers used, is two to ten times faster than Xander. CONCLUSION: MegaGTA improves on the algorithm of Xander and achieves higher sensitivity, accuracy and speed. Moreover, it is capable of assembling gene sequences from ultra-large metagenomic datasets. Its source code is freely available at https://github.com/HKU-BAL/megagta . BioMed Central 2017-10-16 /pmc/articles/PMC5657035/ /pubmed/29072142 http://dx.doi.org/10.1186/s12859-017-1825-3 Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Software
Li, Dinghua
Huang, Yukun
Leung, Chi-Ming
Luo, Ruibang
Ting, Hing-Fung
Lam, Tak-Wah
MegaGTA: a sensitive and accurate metagenomic gene-targeted assembler using iterative de Bruijn graphs
title MegaGTA: a sensitive and accurate metagenomic gene-targeted assembler using iterative de Bruijn graphs
title_full MegaGTA: a sensitive and accurate metagenomic gene-targeted assembler using iterative de Bruijn graphs
title_fullStr MegaGTA: a sensitive and accurate metagenomic gene-targeted assembler using iterative de Bruijn graphs
title_full_unstemmed MegaGTA: a sensitive and accurate metagenomic gene-targeted assembler using iterative de Bruijn graphs
title_short MegaGTA: a sensitive and accurate metagenomic gene-targeted assembler using iterative de Bruijn graphs
title_sort megagta: a sensitive and accurate metagenomic gene-targeted assembler using iterative de bruijn graphs
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5657035/
https://www.ncbi.nlm.nih.gov/pubmed/29072142
http://dx.doi.org/10.1186/s12859-017-1825-3
work_keys_str_mv AT lidinghua megagtaasensitiveandaccuratemetagenomicgenetargetedassemblerusingiterativedebruijngraphs
AT huangyukun megagtaasensitiveandaccuratemetagenomicgenetargetedassemblerusingiterativedebruijngraphs
AT leungchiming megagtaasensitiveandaccuratemetagenomicgenetargetedassemblerusingiterativedebruijngraphs
AT luoruibang megagtaasensitiveandaccuratemetagenomicgenetargetedassemblerusingiterativedebruijngraphs
AT tinghingfung megagtaasensitiveandaccuratemetagenomicgenetargetedassemblerusingiterativedebruijngraphs
AT lamtakwah megagtaasensitiveandaccuratemetagenomicgenetargetedassemblerusingiterativedebruijngraphs