Cargando…
Chaining for accurate alignment of erroneous long reads to acyclic variation graphs
MOTIVATION: Aligning reads to a variation graph is a standard task in pangenomics, with downstream applications such as improving variant calling. While the vg toolkit [Garrison et al. (Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol 2...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10423031/ https://www.ncbi.nlm.nih.gov/pubmed/37494467 http://dx.doi.org/10.1093/bioinformatics/btad460 |
_version_ | 1785089358778335232 |
---|---|
author | Ma, Jun Cáceres, Manuel Salmela, Leena Mäkinen, Veli Tomescu, Alexandru I |
author_facet | Ma, Jun Cáceres, Manuel Salmela, Leena Mäkinen, Veli Tomescu, Alexandru I |
author_sort | Ma, Jun |
collection | PubMed |
description | MOTIVATION: Aligning reads to a variation graph is a standard task in pangenomics, with downstream applications such as improving variant calling. While the vg toolkit [Garrison et al. (Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol 2018;36:875–9)] is a popular aligner of short reads, GraphAligner [Rautiainen and Marschall (GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol 2020;21:253–28)] is the state-of-the-art aligner of erroneous long reads. GraphAligner works by finding candidate read occurrences based on individually extending the best seeds of the read in the variation graph. However, a more principled approach recognized in the community is to co-linearly chain multiple seeds. RESULTS: We present a new algorithm to co-linearly chain a set of seeds in a string labeled acyclic graph, together with the first efficient implementation of such a co-linear chaining algorithm into a new aligner of erroneous long reads to acyclic variation graphs, GraphChainer. We run experiments aligning real and simulated PacBio CLR reads with average error rates 15% and 5%. Compared to GraphAligner, GraphChainer aligns 12–17% more reads, and 21–28% more total read length, on real PacBio CLR reads from human chromosomes 1, 22, and the whole human pangenome. On both simulated and real data, GraphChainer aligns between 95% and 99% of all reads, and of total read length. We also show that minigraph [Li et al. (The design and construction of reference pangenome graphs with minigraph. Genome Biol 2020;21:265–19.)] and minichain [Chandra and Jain (Sequence to graph alignment using gap-sensitive co-linear chaining. In: Proceedings of the 27th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2023). Springer, 2023, 58–73.)] obtain an accuracy of <60% on this setting. AVAILABILITY AND IMPLEMENTATION: GraphChainer is freely available at https://github.com/algbio/GraphChainer. The datasets and evaluation pipeline can be reached from the previous address. |
format | Online Article Text |
id | pubmed-10423031 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-104230312023-08-13 Chaining for accurate alignment of erroneous long reads to acyclic variation graphs Ma, Jun Cáceres, Manuel Salmela, Leena Mäkinen, Veli Tomescu, Alexandru I Bioinformatics Original Paper MOTIVATION: Aligning reads to a variation graph is a standard task in pangenomics, with downstream applications such as improving variant calling. While the vg toolkit [Garrison et al. (Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol 2018;36:875–9)] is a popular aligner of short reads, GraphAligner [Rautiainen and Marschall (GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol 2020;21:253–28)] is the state-of-the-art aligner of erroneous long reads. GraphAligner works by finding candidate read occurrences based on individually extending the best seeds of the read in the variation graph. However, a more principled approach recognized in the community is to co-linearly chain multiple seeds. RESULTS: We present a new algorithm to co-linearly chain a set of seeds in a string labeled acyclic graph, together with the first efficient implementation of such a co-linear chaining algorithm into a new aligner of erroneous long reads to acyclic variation graphs, GraphChainer. We run experiments aligning real and simulated PacBio CLR reads with average error rates 15% and 5%. Compared to GraphAligner, GraphChainer aligns 12–17% more reads, and 21–28% more total read length, on real PacBio CLR reads from human chromosomes 1, 22, and the whole human pangenome. On both simulated and real data, GraphChainer aligns between 95% and 99% of all reads, and of total read length. We also show that minigraph [Li et al. (The design and construction of reference pangenome graphs with minigraph. Genome Biol 2020;21:265–19.)] and minichain [Chandra and Jain (Sequence to graph alignment using gap-sensitive co-linear chaining. In: Proceedings of the 27th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2023). Springer, 2023, 58–73.)] obtain an accuracy of <60% on this setting. AVAILABILITY AND IMPLEMENTATION: GraphChainer is freely available at https://github.com/algbio/GraphChainer. The datasets and evaluation pipeline can be reached from the previous address. Oxford University Press 2023-07-26 /pmc/articles/PMC10423031/ /pubmed/37494467 http://dx.doi.org/10.1093/bioinformatics/btad460 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Paper Ma, Jun Cáceres, Manuel Salmela, Leena Mäkinen, Veli Tomescu, Alexandru I Chaining for accurate alignment of erroneous long reads to acyclic variation graphs |
title | Chaining for accurate alignment of erroneous long reads to acyclic variation graphs |
title_full | Chaining for accurate alignment of erroneous long reads to acyclic variation graphs |
title_fullStr | Chaining for accurate alignment of erroneous long reads to acyclic variation graphs |
title_full_unstemmed | Chaining for accurate alignment of erroneous long reads to acyclic variation graphs |
title_short | Chaining for accurate alignment of erroneous long reads to acyclic variation graphs |
title_sort | chaining for accurate alignment of erroneous long reads to acyclic variation graphs |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10423031/ https://www.ncbi.nlm.nih.gov/pubmed/37494467 http://dx.doi.org/10.1093/bioinformatics/btad460 |
work_keys_str_mv | AT majun chainingforaccuratealignmentoferroneouslongreadstoacyclicvariationgraphs AT caceresmanuel chainingforaccuratealignmentoferroneouslongreadstoacyclicvariationgraphs AT salmelaleena chainingforaccuratealignmentoferroneouslongreadstoacyclicvariationgraphs AT makinenveli chainingforaccuratealignmentoferroneouslongreadstoacyclicvariationgraphs AT tomescualexandrui chainingforaccuratealignmentoferroneouslongreadstoacyclicvariationgraphs |