Cargando…

Read mapping on de Bruijn graphs

BACKGROUND: Next Generation Sequencing (NGS) has dramatically enhanced our ability to sequence genomes, but not to assemble them. In practice, many published genome sequences remain in the state of a large set of contigs. Each contig describes the sequence found along some path of the assembly graph...

Descripción completa

Detalles Bibliográficos
Autores principales: Limasset, Antoine, Cazaux, Bastien, Rivals, Eric, Peterlongo, Pierre
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4910249/
https://www.ncbi.nlm.nih.gov/pubmed/27306641
http://dx.doi.org/10.1186/s12859-016-1103-9
_version_ 1782437980574580736
author Limasset, Antoine
Cazaux, Bastien
Rivals, Eric
Peterlongo, Pierre
author_facet Limasset, Antoine
Cazaux, Bastien
Rivals, Eric
Peterlongo, Pierre
author_sort Limasset, Antoine
collection PubMed
description BACKGROUND: Next Generation Sequencing (NGS) has dramatically enhanced our ability to sequence genomes, but not to assemble them. In practice, many published genome sequences remain in the state of a large set of contigs. Each contig describes the sequence found along some path of the assembly graph, however, the set of contigs does not record all the sequence information contained in that graph. Although many subsequent analyses can be performed with the set of contigs, one may ask whether mapping reads on the contigs is as informative as mapping them on the paths of the assembly graph. Currently, one lacks practical tools to perform mapping on such graphs. RESULTS: Here, we propose a formal definition of mapping on a de Bruijn graph, analyse the problem complexity which turns out to be NP-complete, and provide a practical solution. We propose a pipeline called GGMAP (Greedy Graph MAPping). Its novelty is a procedure to map reads on branching paths of the graph, for which we designed a heuristic algorithm called BGREAT (de Bruijn Graph REAd mapping Tool). For the sake of efficiency, BGREAT rewrites a read sequence as a succession of unitigs sequences. GGMAP can map millions of reads per CPU hour on a de Bruijn graph built from a large set of human genomic reads. Surprisingly, results show that up to 22 % more reads can be mapped on the graph but not on the contig set. CONCLUSIONS: Although mapping reads on a de Bruijn graph is complex task, our proposal offers a practical solution combining efficiency with an improved mapping capacity compared to assembly-based mapping even for complex eukaryotic data. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1103-9) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4910249
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-49102492016-06-17 Read mapping on de Bruijn graphs Limasset, Antoine Cazaux, Bastien Rivals, Eric Peterlongo, Pierre BMC Bioinformatics Research Article BACKGROUND: Next Generation Sequencing (NGS) has dramatically enhanced our ability to sequence genomes, but not to assemble them. In practice, many published genome sequences remain in the state of a large set of contigs. Each contig describes the sequence found along some path of the assembly graph, however, the set of contigs does not record all the sequence information contained in that graph. Although many subsequent analyses can be performed with the set of contigs, one may ask whether mapping reads on the contigs is as informative as mapping them on the paths of the assembly graph. Currently, one lacks practical tools to perform mapping on such graphs. RESULTS: Here, we propose a formal definition of mapping on a de Bruijn graph, analyse the problem complexity which turns out to be NP-complete, and provide a practical solution. We propose a pipeline called GGMAP (Greedy Graph MAPping). Its novelty is a procedure to map reads on branching paths of the graph, for which we designed a heuristic algorithm called BGREAT (de Bruijn Graph REAd mapping Tool). For the sake of efficiency, BGREAT rewrites a read sequence as a succession of unitigs sequences. GGMAP can map millions of reads per CPU hour on a de Bruijn graph built from a large set of human genomic reads. Surprisingly, results show that up to 22 % more reads can be mapped on the graph but not on the contig set. CONCLUSIONS: Although mapping reads on a de Bruijn graph is complex task, our proposal offers a practical solution combining efficiency with an improved mapping capacity compared to assembly-based mapping even for complex eukaryotic data. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1103-9) contains supplementary material, which is available to authorized users. BioMed Central 2016-06-16 /pmc/articles/PMC4910249/ /pubmed/27306641 http://dx.doi.org/10.1186/s12859-016-1103-9 Text en © Limasset et al. 2016 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Limasset, Antoine
Cazaux, Bastien
Rivals, Eric
Peterlongo, Pierre
Read mapping on de Bruijn graphs
title Read mapping on de Bruijn graphs
title_full Read mapping on de Bruijn graphs
title_fullStr Read mapping on de Bruijn graphs
title_full_unstemmed Read mapping on de Bruijn graphs
title_short Read mapping on de Bruijn graphs
title_sort read mapping on de bruijn graphs
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4910249/
https://www.ncbi.nlm.nih.gov/pubmed/27306641
http://dx.doi.org/10.1186/s12859-016-1103-9
work_keys_str_mv AT limassetantoine readmappingondebruijngraphs
AT cazauxbastien readmappingondebruijngraphs
AT rivalseric readmappingondebruijngraphs
AT peterlongopierre readmappingondebruijngraphs