Cargando…
Read mapping on de Bruijn graphs
BACKGROUND: Next Generation Sequencing (NGS) has dramatically enhanced our ability to sequence genomes, but not to assemble them. In practice, many published genome sequences remain in the state of a large set of contigs. Each contig describes the sequence found along some path of the assembly graph...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4910249/ https://www.ncbi.nlm.nih.gov/pubmed/27306641 http://dx.doi.org/10.1186/s12859-016-1103-9 |
_version_ | 1782437980574580736 |
---|---|
author | Limasset, Antoine Cazaux, Bastien Rivals, Eric Peterlongo, Pierre |
author_facet | Limasset, Antoine Cazaux, Bastien Rivals, Eric Peterlongo, Pierre |
author_sort | Limasset, Antoine |
collection | PubMed |
description | BACKGROUND: Next Generation Sequencing (NGS) has dramatically enhanced our ability to sequence genomes, but not to assemble them. In practice, many published genome sequences remain in the state of a large set of contigs. Each contig describes the sequence found along some path of the assembly graph, however, the set of contigs does not record all the sequence information contained in that graph. Although many subsequent analyses can be performed with the set of contigs, one may ask whether mapping reads on the contigs is as informative as mapping them on the paths of the assembly graph. Currently, one lacks practical tools to perform mapping on such graphs. RESULTS: Here, we propose a formal definition of mapping on a de Bruijn graph, analyse the problem complexity which turns out to be NP-complete, and provide a practical solution. We propose a pipeline called GGMAP (Greedy Graph MAPping). Its novelty is a procedure to map reads on branching paths of the graph, for which we designed a heuristic algorithm called BGREAT (de Bruijn Graph REAd mapping Tool). For the sake of efficiency, BGREAT rewrites a read sequence as a succession of unitigs sequences. GGMAP can map millions of reads per CPU hour on a de Bruijn graph built from a large set of human genomic reads. Surprisingly, results show that up to 22 % more reads can be mapped on the graph but not on the contig set. CONCLUSIONS: Although mapping reads on a de Bruijn graph is complex task, our proposal offers a practical solution combining efficiency with an improved mapping capacity compared to assembly-based mapping even for complex eukaryotic data. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1103-9) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-4910249 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-49102492016-06-17 Read mapping on de Bruijn graphs Limasset, Antoine Cazaux, Bastien Rivals, Eric Peterlongo, Pierre BMC Bioinformatics Research Article BACKGROUND: Next Generation Sequencing (NGS) has dramatically enhanced our ability to sequence genomes, but not to assemble them. In practice, many published genome sequences remain in the state of a large set of contigs. Each contig describes the sequence found along some path of the assembly graph, however, the set of contigs does not record all the sequence information contained in that graph. Although many subsequent analyses can be performed with the set of contigs, one may ask whether mapping reads on the contigs is as informative as mapping them on the paths of the assembly graph. Currently, one lacks practical tools to perform mapping on such graphs. RESULTS: Here, we propose a formal definition of mapping on a de Bruijn graph, analyse the problem complexity which turns out to be NP-complete, and provide a practical solution. We propose a pipeline called GGMAP (Greedy Graph MAPping). Its novelty is a procedure to map reads on branching paths of the graph, for which we designed a heuristic algorithm called BGREAT (de Bruijn Graph REAd mapping Tool). For the sake of efficiency, BGREAT rewrites a read sequence as a succession of unitigs sequences. GGMAP can map millions of reads per CPU hour on a de Bruijn graph built from a large set of human genomic reads. Surprisingly, results show that up to 22 % more reads can be mapped on the graph but not on the contig set. CONCLUSIONS: Although mapping reads on a de Bruijn graph is complex task, our proposal offers a practical solution combining efficiency with an improved mapping capacity compared to assembly-based mapping even for complex eukaryotic data. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1103-9) contains supplementary material, which is available to authorized users. BioMed Central 2016-06-16 /pmc/articles/PMC4910249/ /pubmed/27306641 http://dx.doi.org/10.1186/s12859-016-1103-9 Text en © Limasset et al. 2016 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Article Limasset, Antoine Cazaux, Bastien Rivals, Eric Peterlongo, Pierre Read mapping on de Bruijn graphs |
title | Read mapping on de Bruijn graphs |
title_full | Read mapping on de Bruijn graphs |
title_fullStr | Read mapping on de Bruijn graphs |
title_full_unstemmed | Read mapping on de Bruijn graphs |
title_short | Read mapping on de Bruijn graphs |
title_sort | read mapping on de bruijn graphs |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4910249/ https://www.ncbi.nlm.nih.gov/pubmed/27306641 http://dx.doi.org/10.1186/s12859-016-1103-9 |
work_keys_str_mv | AT limassetantoine readmappingondebruijngraphs AT cazauxbastien readmappingondebruijngraphs AT rivalseric readmappingondebruijngraphs AT peterlongopierre readmappingondebruijngraphs |