Cargando…

Accel-Align: a fast sequence mapper and aligner based on the seed–embed–extend method

BACKGROUND: Improvements in sequencing technology continue to drive sequencing cost towards $100 per genome. However, mapping sequenced data to a reference genome remains a computationally-intensive task due to the dependence on edit distance for dealing with INDELs and mismatches introduced by sequ...

Descripción completa

Detalles Bibliográficos
Autores principales:	Yan, Yiqing, Chaturvedi, Nimisha, Appuswamy, Raja
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2021
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8139006/ https://www.ncbi.nlm.nih.gov/pubmed/34016035 http://dx.doi.org/10.1186/s12859-021-04162-z

_version_	1783695919108063232
author	Yan, Yiqing Chaturvedi, Nimisha Appuswamy, Raja
author_facet	Yan, Yiqing Chaturvedi, Nimisha Appuswamy, Raja
author_sort	Yan, Yiqing
collection	PubMed
description	BACKGROUND: Improvements in sequencing technology continue to drive sequencing cost towards $100 per genome. However, mapping sequenced data to a reference genome remains a computationally-intensive task due to the dependence on edit distance for dealing with INDELs and mismatches introduced by sequencing. All modern aligners use seed–filter–extend methodology and rely on filtration heuristics to reduce the overhead of edit distance computation. However, filtering has inherent performance–accuracy trade-offs that limits its effectiveness. RESULTS: Motivated by algorithmic advances in randomized low-distortion embedding, we introduce SEE, a new methodology for developing sequence mappers and aligners. While SFE focuses on eliminating sub-optimal candidates, SEE focuses instead on identifying optimal candidates. To do so, SEE transforms the read and reference strings from edit distance regime to the Hamming regime by embedding them using a randomized algorithm, and uses Hamming distance over the embedded set to identify optimal candidates. To show that SEE performs well in practice, we present Accel-Align an SEE-based short-read sequence mapper and aligner that is 3–12[Formula: see text] faster than state-of-the-art aligners on commodity CPUs, without any special-purpose hardware, while providing comparable accuracy. CONCLUSIONS: As sequencing technologies continue to increase read length while improving throughput and accuracy, we believe that randomized embeddings open up new avenues for optimization that cannot be achieved by using edit distance. Thus, the techniques presented in this paper have a much broader scope as they can be used for other applications like graph alignment, multiple sequence alignment, and sequence assembly. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04162-z.
format	Online Article Text
id	pubmed-8139006
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-81390062021-05-21 Accel-Align: a fast sequence mapper and aligner based on the seed–embed–extend method Yan, Yiqing Chaturvedi, Nimisha Appuswamy, Raja BMC Bioinformatics Methodology Article BACKGROUND: Improvements in sequencing technology continue to drive sequencing cost towards $100 per genome. However, mapping sequenced data to a reference genome remains a computationally-intensive task due to the dependence on edit distance for dealing with INDELs and mismatches introduced by sequencing. All modern aligners use seed–filter–extend methodology and rely on filtration heuristics to reduce the overhead of edit distance computation. However, filtering has inherent performance–accuracy trade-offs that limits its effectiveness. RESULTS: Motivated by algorithmic advances in randomized low-distortion embedding, we introduce SEE, a new methodology for developing sequence mappers and aligners. While SFE focuses on eliminating sub-optimal candidates, SEE focuses instead on identifying optimal candidates. To do so, SEE transforms the read and reference strings from edit distance regime to the Hamming regime by embedding them using a randomized algorithm, and uses Hamming distance over the embedded set to identify optimal candidates. To show that SEE performs well in practice, we present Accel-Align an SEE-based short-read sequence mapper and aligner that is 3–12[Formula: see text] faster than state-of-the-art aligners on commodity CPUs, without any special-purpose hardware, while providing comparable accuracy. CONCLUSIONS: As sequencing technologies continue to increase read length while improving throughput and accuracy, we believe that randomized embeddings open up new avenues for optimization that cannot be achieved by using edit distance. Thus, the techniques presented in this paper have a much broader scope as they can be used for other applications like graph alignment, multiple sequence alignment, and sequence assembly. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04162-z. BioMed Central 2021-05-20 /pmc/articles/PMC8139006/ /pubmed/34016035 http://dx.doi.org/10.1186/s12859-021-04162-z Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Methodology Article Yan, Yiqing Chaturvedi, Nimisha Appuswamy, Raja Accel-Align: a fast sequence mapper and aligner based on the seed–embed–extend method
title	Accel-Align: a fast sequence mapper and aligner based on the seed–embed–extend method
title_full	Accel-Align: a fast sequence mapper and aligner based on the seed–embed–extend method
title_fullStr	Accel-Align: a fast sequence mapper and aligner based on the seed–embed–extend method
title_full_unstemmed	Accel-Align: a fast sequence mapper and aligner based on the seed–embed–extend method
title_short	Accel-Align: a fast sequence mapper and aligner based on the seed–embed–extend method
title_sort	accel-align: a fast sequence mapper and aligner based on the seed–embed–extend method
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8139006/ https://www.ncbi.nlm.nih.gov/pubmed/34016035 http://dx.doi.org/10.1186/s12859-021-04162-z
work_keys_str_mv	AT yanyiqing accelalignafastsequencemapperandalignerbasedontheseedembedextendmethod AT chaturvedinimisha accelalignafastsequencemapperandalignerbasedontheseedembedextendmethod AT appuswamyraja accelalignafastsequencemapperandalignerbasedontheseedembedextendmethod

Accel-Align: a fast sequence mapper and aligner based on the seed–embed–extend method

Ejemplares similares