Cargando…
Efficient short read mapping to a pangenome that is represented by a graph of ED strings
MOTIVATION: A pangenome represents many diverse genome sequences of the same species. In order to cope with small variations as well as structural variations, recent research focused on the development of graph-based models of pangenomes. Mapping is the process of finding the original location of a...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10232250/ https://www.ncbi.nlm.nih.gov/pubmed/37171844 http://dx.doi.org/10.1093/bioinformatics/btad320 |
_version_ | 1785051931510571008 |
---|---|
author | Büchler, Thomas Olbrich, Jannik Ohlebusch, Enno |
author_facet | Büchler, Thomas Olbrich, Jannik Ohlebusch, Enno |
author_sort | Büchler, Thomas |
collection | PubMed |
description | MOTIVATION: A pangenome represents many diverse genome sequences of the same species. In order to cope with small variations as well as structural variations, recent research focused on the development of graph-based models of pangenomes. Mapping is the process of finding the original location of a DNA read in a reference sequence, typically a genome. Using a pangenome instead of a (linear) reference genome can, e.g. reduce mapping bias, the tendency to incorrectly map sequences that differ from the reference genome. Mapping reads to a graph, however, is more complex and needs more resources than mapping to a reference genome. Reducing the complexity of the graph by encoding simple variations like SNPs in a simple way can accelerate read mapping and reduce the memory requirements at the same time. RESULTS: We introduce graphs based on elastic-degenerate strings (ED strings, EDS) and the linearized form of these EDS graphs as a new representation for pangenomes. In this representation, small variations are encoded directly in the sequence. Structural variations are encoded in a graph structure. This reduces the size of the representation in comparison to sequence graphs. In the linearized form, mapping techniques that are known from ordinary strings can be applied with appropriate adjustments. Since most variations are expressed directly in the sequence, the mapping process rarely has to take edges of the EDS graph into account. We developed a prototypical software tool GED-MAP that uses this representation together with a minimizer index to map short reads to the pangenome. Our experiments show that the new method works on a whole human genome scale, taking structural variants properly into account. The advantage of GED-MAP, compared with other pangenomic short read mappers, is that the new representation allows for a simple indexing method. This makes GED-MAP fast and memory efficient. AVAILABILITY AND IMPLEMENTATION: Sources are available at: https://github.com/thomas-buechler-ulm/gedmap. |
format | Online Article Text |
id | pubmed-10232250 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-102322502023-06-01 Efficient short read mapping to a pangenome that is represented by a graph of ED strings Büchler, Thomas Olbrich, Jannik Ohlebusch, Enno Bioinformatics Original Paper MOTIVATION: A pangenome represents many diverse genome sequences of the same species. In order to cope with small variations as well as structural variations, recent research focused on the development of graph-based models of pangenomes. Mapping is the process of finding the original location of a DNA read in a reference sequence, typically a genome. Using a pangenome instead of a (linear) reference genome can, e.g. reduce mapping bias, the tendency to incorrectly map sequences that differ from the reference genome. Mapping reads to a graph, however, is more complex and needs more resources than mapping to a reference genome. Reducing the complexity of the graph by encoding simple variations like SNPs in a simple way can accelerate read mapping and reduce the memory requirements at the same time. RESULTS: We introduce graphs based on elastic-degenerate strings (ED strings, EDS) and the linearized form of these EDS graphs as a new representation for pangenomes. In this representation, small variations are encoded directly in the sequence. Structural variations are encoded in a graph structure. This reduces the size of the representation in comparison to sequence graphs. In the linearized form, mapping techniques that are known from ordinary strings can be applied with appropriate adjustments. Since most variations are expressed directly in the sequence, the mapping process rarely has to take edges of the EDS graph into account. We developed a prototypical software tool GED-MAP that uses this representation together with a minimizer index to map short reads to the pangenome. Our experiments show that the new method works on a whole human genome scale, taking structural variants properly into account. The advantage of GED-MAP, compared with other pangenomic short read mappers, is that the new representation allows for a simple indexing method. This makes GED-MAP fast and memory efficient. AVAILABILITY AND IMPLEMENTATION: Sources are available at: https://github.com/thomas-buechler-ulm/gedmap. Oxford University Press 2023-05-12 /pmc/articles/PMC10232250/ /pubmed/37171844 http://dx.doi.org/10.1093/bioinformatics/btad320 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Paper Büchler, Thomas Olbrich, Jannik Ohlebusch, Enno Efficient short read mapping to a pangenome that is represented by a graph of ED strings |
title | Efficient short read mapping to a pangenome that is represented by a graph of ED strings |
title_full | Efficient short read mapping to a pangenome that is represented by a graph of ED strings |
title_fullStr | Efficient short read mapping to a pangenome that is represented by a graph of ED strings |
title_full_unstemmed | Efficient short read mapping to a pangenome that is represented by a graph of ED strings |
title_short | Efficient short read mapping to a pangenome that is represented by a graph of ED strings |
title_sort | efficient short read mapping to a pangenome that is represented by a graph of ed strings |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10232250/ https://www.ncbi.nlm.nih.gov/pubmed/37171844 http://dx.doi.org/10.1093/bioinformatics/btad320 |
work_keys_str_mv | AT buchlerthomas efficientshortreadmappingtoapangenomethatisrepresentedbyagraphofedstrings AT olbrichjannik efficientshortreadmappingtoapangenomethatisrepresentedbyagraphofedstrings AT ohlebuschenno efficientshortreadmappingtoapangenomethatisrepresentedbyagraphofedstrings |