Cargando…
Accel-Align: a fast sequence mapper and aligner based on the seed–embed–extend method
BACKGROUND: Improvements in sequencing technology continue to drive sequencing cost towards $100 per genome. However, mapping sequenced data to a reference genome remains a computationally-intensive task due to the dependence on edit distance for dealing with INDELs and mismatches introduced by sequ...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8139006/ https://www.ncbi.nlm.nih.gov/pubmed/34016035 http://dx.doi.org/10.1186/s12859-021-04162-z |
_version_ | 1783695919108063232 |
---|---|
author | Yan, Yiqing Chaturvedi, Nimisha Appuswamy, Raja |
author_facet | Yan, Yiqing Chaturvedi, Nimisha Appuswamy, Raja |
author_sort | Yan, Yiqing |
collection | PubMed |
description | BACKGROUND: Improvements in sequencing technology continue to drive sequencing cost towards $100 per genome. However, mapping sequenced data to a reference genome remains a computationally-intensive task due to the dependence on edit distance for dealing with INDELs and mismatches introduced by sequencing. All modern aligners use seed–filter–extend methodology and rely on filtration heuristics to reduce the overhead of edit distance computation. However, filtering has inherent performance–accuracy trade-offs that limits its effectiveness. RESULTS: Motivated by algorithmic advances in randomized low-distortion embedding, we introduce SEE, a new methodology for developing sequence mappers and aligners. While SFE focuses on eliminating sub-optimal candidates, SEE focuses instead on identifying optimal candidates. To do so, SEE transforms the read and reference strings from edit distance regime to the Hamming regime by embedding them using a randomized algorithm, and uses Hamming distance over the embedded set to identify optimal candidates. To show that SEE performs well in practice, we present Accel-Align an SEE-based short-read sequence mapper and aligner that is 3–12[Formula: see text] faster than state-of-the-art aligners on commodity CPUs, without any special-purpose hardware, while providing comparable accuracy. CONCLUSIONS: As sequencing technologies continue to increase read length while improving throughput and accuracy, we believe that randomized embeddings open up new avenues for optimization that cannot be achieved by using edit distance. Thus, the techniques presented in this paper have a much broader scope as they can be used for other applications like graph alignment, multiple sequence alignment, and sequence assembly. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04162-z. |
format | Online Article Text |
id | pubmed-8139006 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-81390062021-05-21 Accel-Align: a fast sequence mapper and aligner based on the seed–embed–extend method Yan, Yiqing Chaturvedi, Nimisha Appuswamy, Raja BMC Bioinformatics Methodology Article BACKGROUND: Improvements in sequencing technology continue to drive sequencing cost towards $100 per genome. However, mapping sequenced data to a reference genome remains a computationally-intensive task due to the dependence on edit distance for dealing with INDELs and mismatches introduced by sequencing. All modern aligners use seed–filter–extend methodology and rely on filtration heuristics to reduce the overhead of edit distance computation. However, filtering has inherent performance–accuracy trade-offs that limits its effectiveness. RESULTS: Motivated by algorithmic advances in randomized low-distortion embedding, we introduce SEE, a new methodology for developing sequence mappers and aligners. While SFE focuses on eliminating sub-optimal candidates, SEE focuses instead on identifying optimal candidates. To do so, SEE transforms the read and reference strings from edit distance regime to the Hamming regime by embedding them using a randomized algorithm, and uses Hamming distance over the embedded set to identify optimal candidates. To show that SEE performs well in practice, we present Accel-Align an SEE-based short-read sequence mapper and aligner that is 3–12[Formula: see text] faster than state-of-the-art aligners on commodity CPUs, without any special-purpose hardware, while providing comparable accuracy. CONCLUSIONS: As sequencing technologies continue to increase read length while improving throughput and accuracy, we believe that randomized embeddings open up new avenues for optimization that cannot be achieved by using edit distance. Thus, the techniques presented in this paper have a much broader scope as they can be used for other applications like graph alignment, multiple sequence alignment, and sequence assembly. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04162-z. BioMed Central 2021-05-20 /pmc/articles/PMC8139006/ /pubmed/34016035 http://dx.doi.org/10.1186/s12859-021-04162-z Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Methodology Article Yan, Yiqing Chaturvedi, Nimisha Appuswamy, Raja Accel-Align: a fast sequence mapper and aligner based on the seed–embed–extend method |
title | Accel-Align: a fast sequence mapper and aligner based on the seed–embed–extend method |
title_full | Accel-Align: a fast sequence mapper and aligner based on the seed–embed–extend method |
title_fullStr | Accel-Align: a fast sequence mapper and aligner based on the seed–embed–extend method |
title_full_unstemmed | Accel-Align: a fast sequence mapper and aligner based on the seed–embed–extend method |
title_short | Accel-Align: a fast sequence mapper and aligner based on the seed–embed–extend method |
title_sort | accel-align: a fast sequence mapper and aligner based on the seed–embed–extend method |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8139006/ https://www.ncbi.nlm.nih.gov/pubmed/34016035 http://dx.doi.org/10.1186/s12859-021-04162-z |
work_keys_str_mv | AT yanyiqing accelalignafastsequencemapperandalignerbasedontheseedembedextendmethod AT chaturvedinimisha accelalignafastsequencemapperandalignerbasedontheseedembedextendmethod AT appuswamyraja accelalignafastsequencemapperandalignerbasedontheseedembedextendmethod |