Cargando…

Accel-Align: a fast sequence mapper and aligner based on the seed–embed–extend method

BACKGROUND: Improvements in sequencing technology continue to drive sequencing cost towards $100 per genome. However, mapping sequenced data to a reference genome remains a computationally-intensive task due to the dependence on edit distance for dealing with INDELs and mismatches introduced by sequ...

Descripción completa

Detalles Bibliográficos
Autores principales: Yan, Yiqing, Chaturvedi, Nimisha, Appuswamy, Raja
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8139006/
https://www.ncbi.nlm.nih.gov/pubmed/34016035
http://dx.doi.org/10.1186/s12859-021-04162-z
_version_ 1783695919108063232
author Yan, Yiqing
Chaturvedi, Nimisha
Appuswamy, Raja
author_facet Yan, Yiqing
Chaturvedi, Nimisha
Appuswamy, Raja
author_sort Yan, Yiqing
collection PubMed
description BACKGROUND: Improvements in sequencing technology continue to drive sequencing cost towards $100 per genome. However, mapping sequenced data to a reference genome remains a computationally-intensive task due to the dependence on edit distance for dealing with INDELs and mismatches introduced by sequencing. All modern aligners use seed–filter–extend methodology and rely on filtration heuristics to reduce the overhead of edit distance computation. However, filtering has inherent performance–accuracy trade-offs that limits its effectiveness. RESULTS: Motivated by algorithmic advances in randomized low-distortion embedding, we introduce SEE, a new methodology for developing sequence mappers and aligners. While SFE focuses on eliminating sub-optimal candidates, SEE focuses instead on identifying optimal candidates. To do so, SEE transforms the read and reference strings from edit distance regime to the Hamming regime by embedding them using a randomized algorithm, and uses Hamming distance over the embedded set to identify optimal candidates. To show that SEE performs well in practice, we present Accel-Align an SEE-based short-read sequence mapper and aligner that is 3–12[Formula: see text] faster than state-of-the-art aligners on commodity CPUs, without any special-purpose hardware, while providing comparable accuracy. CONCLUSIONS: As sequencing technologies continue to increase read length while improving throughput and accuracy, we believe that randomized embeddings open up new avenues for optimization that cannot be achieved by using edit distance. Thus, the techniques presented in this paper have a much broader scope as they can be used for other applications like graph alignment, multiple sequence alignment, and sequence assembly. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04162-z.
format Online
Article
Text
id pubmed-8139006
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-81390062021-05-21 Accel-Align: a fast sequence mapper and aligner based on the seed–embed–extend method Yan, Yiqing Chaturvedi, Nimisha Appuswamy, Raja BMC Bioinformatics Methodology Article BACKGROUND: Improvements in sequencing technology continue to drive sequencing cost towards $100 per genome. However, mapping sequenced data to a reference genome remains a computationally-intensive task due to the dependence on edit distance for dealing with INDELs and mismatches introduced by sequencing. All modern aligners use seed–filter–extend methodology and rely on filtration heuristics to reduce the overhead of edit distance computation. However, filtering has inherent performance–accuracy trade-offs that limits its effectiveness. RESULTS: Motivated by algorithmic advances in randomized low-distortion embedding, we introduce SEE, a new methodology for developing sequence mappers and aligners. While SFE focuses on eliminating sub-optimal candidates, SEE focuses instead on identifying optimal candidates. To do so, SEE transforms the read and reference strings from edit distance regime to the Hamming regime by embedding them using a randomized algorithm, and uses Hamming distance over the embedded set to identify optimal candidates. To show that SEE performs well in practice, we present Accel-Align an SEE-based short-read sequence mapper and aligner that is 3–12[Formula: see text] faster than state-of-the-art aligners on commodity CPUs, without any special-purpose hardware, while providing comparable accuracy. CONCLUSIONS: As sequencing technologies continue to increase read length while improving throughput and accuracy, we believe that randomized embeddings open up new avenues for optimization that cannot be achieved by using edit distance. Thus, the techniques presented in this paper have a much broader scope as they can be used for other applications like graph alignment, multiple sequence alignment, and sequence assembly. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04162-z. BioMed Central 2021-05-20 /pmc/articles/PMC8139006/ /pubmed/34016035 http://dx.doi.org/10.1186/s12859-021-04162-z Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Methodology Article
Yan, Yiqing
Chaturvedi, Nimisha
Appuswamy, Raja
Accel-Align: a fast sequence mapper and aligner based on the seed–embed–extend method
title Accel-Align: a fast sequence mapper and aligner based on the seed–embed–extend method
title_full Accel-Align: a fast sequence mapper and aligner based on the seed–embed–extend method
title_fullStr Accel-Align: a fast sequence mapper and aligner based on the seed–embed–extend method
title_full_unstemmed Accel-Align: a fast sequence mapper and aligner based on the seed–embed–extend method
title_short Accel-Align: a fast sequence mapper and aligner based on the seed–embed–extend method
title_sort accel-align: a fast sequence mapper and aligner based on the seed–embed–extend method
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8139006/
https://www.ncbi.nlm.nih.gov/pubmed/34016035
http://dx.doi.org/10.1186/s12859-021-04162-z
work_keys_str_mv AT yanyiqing accelalignafastsequencemapperandalignerbasedontheseedembedextendmethod
AT chaturvedinimisha accelalignafastsequencemapperandalignerbasedontheseedembedextendmethod
AT appuswamyraja accelalignafastsequencemapperandalignerbasedontheseedembedextendmethod