Cargando…

Chaining for accurate alignment of erroneous long reads to acyclic variation graphs

MOTIVATION: Aligning reads to a variation graph is a standard task in pangenomics, with downstream applications such as improving variant calling. While the vg toolkit [Garrison et al. (Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol 2...

Descripción completa

Detalles Bibliográficos
Autores principales: Ma, Jun, Cáceres, Manuel, Salmela, Leena, Mäkinen, Veli, Tomescu, Alexandru I
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10423031/
https://www.ncbi.nlm.nih.gov/pubmed/37494467
http://dx.doi.org/10.1093/bioinformatics/btad460
_version_ 1785089358778335232
author Ma, Jun
Cáceres, Manuel
Salmela, Leena
Mäkinen, Veli
Tomescu, Alexandru I
author_facet Ma, Jun
Cáceres, Manuel
Salmela, Leena
Mäkinen, Veli
Tomescu, Alexandru I
author_sort Ma, Jun
collection PubMed
description MOTIVATION: Aligning reads to a variation graph is a standard task in pangenomics, with downstream applications such as improving variant calling. While the vg toolkit [Garrison et al. (Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol 2018;36:875–9)] is a popular aligner of short reads, GraphAligner [Rautiainen and Marschall (GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol 2020;21:253–28)] is the state-of-the-art aligner of erroneous long reads. GraphAligner works by finding candidate read occurrences based on individually extending the best seeds of the read in the variation graph. However, a more principled approach recognized in the community is to co-linearly chain multiple seeds. RESULTS: We present a new algorithm to co-linearly chain a set of seeds in a string labeled acyclic graph, together with the first efficient implementation of such a co-linear chaining algorithm into a new aligner of erroneous long reads to acyclic variation graphs, GraphChainer. We run experiments aligning real and simulated PacBio CLR reads with average error rates 15% and 5%. Compared to GraphAligner, GraphChainer aligns 12–17% more reads, and 21–28% more total read length, on real PacBio CLR reads from human chromosomes 1, 22, and the whole human pangenome. On both simulated and real data, GraphChainer aligns between 95% and 99% of all reads, and of total read length. We also show that minigraph [Li et al. (The design and construction of reference pangenome graphs with minigraph. Genome Biol 2020;21:265–19.)] and minichain [Chandra and Jain (Sequence to graph alignment using gap-sensitive co-linear chaining. In: Proceedings of the 27th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2023). Springer, 2023, 58–73.)] obtain an accuracy of <60% on this setting. AVAILABILITY AND IMPLEMENTATION: GraphChainer is freely available at https://github.com/algbio/GraphChainer. The datasets and evaluation pipeline can be reached from the previous address.
format Online
Article
Text
id pubmed-10423031
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-104230312023-08-13 Chaining for accurate alignment of erroneous long reads to acyclic variation graphs Ma, Jun Cáceres, Manuel Salmela, Leena Mäkinen, Veli Tomescu, Alexandru I Bioinformatics Original Paper MOTIVATION: Aligning reads to a variation graph is a standard task in pangenomics, with downstream applications such as improving variant calling. While the vg toolkit [Garrison et al. (Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol 2018;36:875–9)] is a popular aligner of short reads, GraphAligner [Rautiainen and Marschall (GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol 2020;21:253–28)] is the state-of-the-art aligner of erroneous long reads. GraphAligner works by finding candidate read occurrences based on individually extending the best seeds of the read in the variation graph. However, a more principled approach recognized in the community is to co-linearly chain multiple seeds. RESULTS: We present a new algorithm to co-linearly chain a set of seeds in a string labeled acyclic graph, together with the first efficient implementation of such a co-linear chaining algorithm into a new aligner of erroneous long reads to acyclic variation graphs, GraphChainer. We run experiments aligning real and simulated PacBio CLR reads with average error rates 15% and 5%. Compared to GraphAligner, GraphChainer aligns 12–17% more reads, and 21–28% more total read length, on real PacBio CLR reads from human chromosomes 1, 22, and the whole human pangenome. On both simulated and real data, GraphChainer aligns between 95% and 99% of all reads, and of total read length. We also show that minigraph [Li et al. (The design and construction of reference pangenome graphs with minigraph. Genome Biol 2020;21:265–19.)] and minichain [Chandra and Jain (Sequence to graph alignment using gap-sensitive co-linear chaining. In: Proceedings of the 27th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2023). Springer, 2023, 58–73.)] obtain an accuracy of <60% on this setting. AVAILABILITY AND IMPLEMENTATION: GraphChainer is freely available at https://github.com/algbio/GraphChainer. The datasets and evaluation pipeline can be reached from the previous address. Oxford University Press 2023-07-26 /pmc/articles/PMC10423031/ /pubmed/37494467 http://dx.doi.org/10.1093/bioinformatics/btad460 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Paper
Ma, Jun
Cáceres, Manuel
Salmela, Leena
Mäkinen, Veli
Tomescu, Alexandru I
Chaining for accurate alignment of erroneous long reads to acyclic variation graphs
title Chaining for accurate alignment of erroneous long reads to acyclic variation graphs
title_full Chaining for accurate alignment of erroneous long reads to acyclic variation graphs
title_fullStr Chaining for accurate alignment of erroneous long reads to acyclic variation graphs
title_full_unstemmed Chaining for accurate alignment of erroneous long reads to acyclic variation graphs
title_short Chaining for accurate alignment of erroneous long reads to acyclic variation graphs
title_sort chaining for accurate alignment of erroneous long reads to acyclic variation graphs
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10423031/
https://www.ncbi.nlm.nih.gov/pubmed/37494467
http://dx.doi.org/10.1093/bioinformatics/btad460
work_keys_str_mv AT majun chainingforaccuratealignmentoferroneouslongreadstoacyclicvariationgraphs
AT caceresmanuel chainingforaccuratealignmentoferroneouslongreadstoacyclicvariationgraphs
AT salmelaleena chainingforaccuratealignmentoferroneouslongreadstoacyclicvariationgraphs
AT makinenveli chainingforaccuratealignmentoferroneouslongreadstoacyclicvariationgraphs
AT tomescualexandrui chainingforaccuratealignmentoferroneouslongreadstoacyclicvariationgraphs