Cargando…

Efficient short read mapping to a pangenome that is represented by a graph of ED strings

MOTIVATION: A pangenome represents many diverse genome sequences of the same species. In order to cope with small variations as well as structural variations, recent research focused on the development of graph-based models of pangenomes. Mapping is the process of finding the original location of a...

Descripción completa

Detalles Bibliográficos
Autores principales: Büchler, Thomas, Olbrich, Jannik, Ohlebusch, Enno
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10232250/
https://www.ncbi.nlm.nih.gov/pubmed/37171844
http://dx.doi.org/10.1093/bioinformatics/btad320
_version_ 1785051931510571008
author Büchler, Thomas
Olbrich, Jannik
Ohlebusch, Enno
author_facet Büchler, Thomas
Olbrich, Jannik
Ohlebusch, Enno
author_sort Büchler, Thomas
collection PubMed
description MOTIVATION: A pangenome represents many diverse genome sequences of the same species. In order to cope with small variations as well as structural variations, recent research focused on the development of graph-based models of pangenomes. Mapping is the process of finding the original location of a DNA read in a reference sequence, typically a genome. Using a pangenome instead of a (linear) reference genome can, e.g. reduce mapping bias, the tendency to incorrectly map sequences that differ from the reference genome. Mapping reads to a graph, however, is more complex and needs more resources than mapping to a reference genome. Reducing the complexity of the graph by encoding simple variations like SNPs in a simple way can accelerate read mapping and reduce the memory requirements at the same time. RESULTS: We introduce graphs based on elastic-degenerate strings (ED strings, EDS) and the linearized form of these EDS graphs as a new representation for pangenomes. In this representation, small variations are encoded directly in the sequence. Structural variations are encoded in a graph structure. This reduces the size of the representation in comparison to sequence graphs. In the linearized form, mapping techniques that are known from ordinary strings can be applied with appropriate adjustments. Since most variations are expressed directly in the sequence, the mapping process rarely has to take edges of the EDS graph into account. We developed a prototypical software tool GED-MAP that uses this representation together with a minimizer index to map short reads to the pangenome. Our experiments show that the new method works on a whole human genome scale, taking structural variants properly into account. The advantage of GED-MAP, compared with other pangenomic short read mappers, is that the new representation allows for a simple indexing method. This makes GED-MAP fast and memory efficient. AVAILABILITY AND IMPLEMENTATION: Sources are available at: https://github.com/thomas-buechler-ulm/gedmap.
format Online
Article
Text
id pubmed-10232250
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-102322502023-06-01 Efficient short read mapping to a pangenome that is represented by a graph of ED strings Büchler, Thomas Olbrich, Jannik Ohlebusch, Enno Bioinformatics Original Paper MOTIVATION: A pangenome represents many diverse genome sequences of the same species. In order to cope with small variations as well as structural variations, recent research focused on the development of graph-based models of pangenomes. Mapping is the process of finding the original location of a DNA read in a reference sequence, typically a genome. Using a pangenome instead of a (linear) reference genome can, e.g. reduce mapping bias, the tendency to incorrectly map sequences that differ from the reference genome. Mapping reads to a graph, however, is more complex and needs more resources than mapping to a reference genome. Reducing the complexity of the graph by encoding simple variations like SNPs in a simple way can accelerate read mapping and reduce the memory requirements at the same time. RESULTS: We introduce graphs based on elastic-degenerate strings (ED strings, EDS) and the linearized form of these EDS graphs as a new representation for pangenomes. In this representation, small variations are encoded directly in the sequence. Structural variations are encoded in a graph structure. This reduces the size of the representation in comparison to sequence graphs. In the linearized form, mapping techniques that are known from ordinary strings can be applied with appropriate adjustments. Since most variations are expressed directly in the sequence, the mapping process rarely has to take edges of the EDS graph into account. We developed a prototypical software tool GED-MAP that uses this representation together with a minimizer index to map short reads to the pangenome. Our experiments show that the new method works on a whole human genome scale, taking structural variants properly into account. The advantage of GED-MAP, compared with other pangenomic short read mappers, is that the new representation allows for a simple indexing method. This makes GED-MAP fast and memory efficient. AVAILABILITY AND IMPLEMENTATION: Sources are available at: https://github.com/thomas-buechler-ulm/gedmap. Oxford University Press 2023-05-12 /pmc/articles/PMC10232250/ /pubmed/37171844 http://dx.doi.org/10.1093/bioinformatics/btad320 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Paper
Büchler, Thomas
Olbrich, Jannik
Ohlebusch, Enno
Efficient short read mapping to a pangenome that is represented by a graph of ED strings
title Efficient short read mapping to a pangenome that is represented by a graph of ED strings
title_full Efficient short read mapping to a pangenome that is represented by a graph of ED strings
title_fullStr Efficient short read mapping to a pangenome that is represented by a graph of ED strings
title_full_unstemmed Efficient short read mapping to a pangenome that is represented by a graph of ED strings
title_short Efficient short read mapping to a pangenome that is represented by a graph of ED strings
title_sort efficient short read mapping to a pangenome that is represented by a graph of ed strings
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10232250/
https://www.ncbi.nlm.nih.gov/pubmed/37171844
http://dx.doi.org/10.1093/bioinformatics/btad320
work_keys_str_mv AT buchlerthomas efficientshortreadmappingtoapangenomethatisrepresentedbyagraphofedstrings
AT olbrichjannik efficientshortreadmappingtoapangenomethatisrepresentedbyagraphofedstrings
AT ohlebuschenno efficientshortreadmappingtoapangenomethatisrepresentedbyagraphofedstrings