Cargando…

Efficient mapping of accurate long reads in minimizer space with mapquik

DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. We focus on the critical problem of mapping, or aligning, low-divergence sequences from long reads (e.g., Pacific Biosciences [PacBio] HiFi) to a reference genome, which poses challenges in t...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ekim, Bariş, Sahlin, Kristoffer, Medvedev, Paul, Berger, Bonnie, Chikhi, Rayan
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Cold Spring Harbor Laboratory Press 2023
Materias:	Methods
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10538364/ https://www.ncbi.nlm.nih.gov/pubmed/37399256 http://dx.doi.org/10.1101/gr.277679.123

_version_	1785113308545679360
author	Ekim, Bariş Sahlin, Kristoffer Medvedev, Paul Berger, Bonnie Chikhi, Rayan
author_facet	Ekim, Bariş Sahlin, Kristoffer Medvedev, Paul Berger, Bonnie Chikhi, Rayan
author_sort	Ekim, Bariş
collection	PubMed
description	DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. We focus on the critical problem of mapping, or aligning, low-divergence sequences from long reads (e.g., Pacific Biosciences [PacBio] HiFi) to a reference genome, which poses challenges in terms of accuracy and computational resources when using cutting-edge read mapping approaches that are designed for all types of alignments. A natural idea would be to optimize efficiency with longer seeds to reduce the probability of extraneous matches; however, contiguous exact seeds quickly reach a sensitivity limit. We introduce mapquik, a novel strategy that creates accurate longer seeds by anchoring alignments through matches of k consecutively sampled minimizers ([Formula: see text]-min-mers) and only indexing [Formula: see text]-min-mers that occur once in the reference genome, thereby unlocking ultrafast mapping while retaining high sensitivity. We show that mapquik significantly accelerates the seeding and chaining steps—fundamental bottlenecks to read mapping—for both the human and maize genomes with [Formula: see text] sensitivity and near-perfect specificity. On the human genome, for both real and simulated reads, mapquik achieves a [Formula: see text] speedup over the state-of-the-art tool minimap2, and on the maize genome, mapquik achieves a [Formula: see text] speedup over minimap2, making mapquik the fastest mapper to date. These accelerations are enabled from not only minimizer-space seeding but also a novel heuristic [Formula: see text] pseudochaining algorithm, which improves upon the long-standing [Formula: see text] bound. Minimizer-space computation builds the foundation for achieving real-time analysis of long-read sequencing data.
format	Online Article Text
id	pubmed-10538364
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Cold Spring Harbor Laboratory Press
record_format	MEDLINE/PubMed
spelling	pubmed-105383642023-09-29 Efficient mapping of accurate long reads in minimizer space with mapquik Ekim, Bariş Sahlin, Kristoffer Medvedev, Paul Berger, Bonnie Chikhi, Rayan Genome Res Methods DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. We focus on the critical problem of mapping, or aligning, low-divergence sequences from long reads (e.g., Pacific Biosciences [PacBio] HiFi) to a reference genome, which poses challenges in terms of accuracy and computational resources when using cutting-edge read mapping approaches that are designed for all types of alignments. A natural idea would be to optimize efficiency with longer seeds to reduce the probability of extraneous matches; however, contiguous exact seeds quickly reach a sensitivity limit. We introduce mapquik, a novel strategy that creates accurate longer seeds by anchoring alignments through matches of k consecutively sampled minimizers ([Formula: see text]-min-mers) and only indexing [Formula: see text]-min-mers that occur once in the reference genome, thereby unlocking ultrafast mapping while retaining high sensitivity. We show that mapquik significantly accelerates the seeding and chaining steps—fundamental bottlenecks to read mapping—for both the human and maize genomes with [Formula: see text] sensitivity and near-perfect specificity. On the human genome, for both real and simulated reads, mapquik achieves a [Formula: see text] speedup over the state-of-the-art tool minimap2, and on the maize genome, mapquik achieves a [Formula: see text] speedup over minimap2, making mapquik the fastest mapper to date. These accelerations are enabled from not only minimizer-space seeding but also a novel heuristic [Formula: see text] pseudochaining algorithm, which improves upon the long-standing [Formula: see text] bound. Minimizer-space computation builds the foundation for achieving real-time analysis of long-read sequencing data. Cold Spring Harbor Laboratory Press 2023-07 /pmc/articles/PMC10538364/ /pubmed/37399256 http://dx.doi.org/10.1101/gr.277679.123 Text en © 2023 Ekim et al.; Published by Cold Spring Harbor Laboratory Press https://creativecommons.org/licenses/by-nc/4.0/This article, published in Genome Research, is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) .
spellingShingle	Methods Ekim, Bariş Sahlin, Kristoffer Medvedev, Paul Berger, Bonnie Chikhi, Rayan Efficient mapping of accurate long reads in minimizer space with mapquik
title	Efficient mapping of accurate long reads in minimizer space with mapquik
title_full	Efficient mapping of accurate long reads in minimizer space with mapquik
title_fullStr	Efficient mapping of accurate long reads in minimizer space with mapquik
title_full_unstemmed	Efficient mapping of accurate long reads in minimizer space with mapquik
title_short	Efficient mapping of accurate long reads in minimizer space with mapquik
title_sort	efficient mapping of accurate long reads in minimizer space with mapquik
topic	Methods
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10538364/ https://www.ncbi.nlm.nih.gov/pubmed/37399256 http://dx.doi.org/10.1101/gr.277679.123
work_keys_str_mv	AT ekimbaris efficientmappingofaccuratelongreadsinminimizerspacewithmapquik AT sahlinkristoffer efficientmappingofaccuratelongreadsinminimizerspacewithmapquik AT medvedevpaul efficientmappingofaccuratelongreadsinminimizerspacewithmapquik AT bergerbonnie efficientmappingofaccuratelongreadsinminimizerspacewithmapquik AT chikhirayan efficientmappingofaccuratelongreadsinminimizerspacewithmapquik

Efficient mapping of accurate long reads in minimizer space with mapquik

Ejemplares similares