Cargando…

Parameterized syncmer schemes improve long-read mapping

MOTIVATION: Sequencing long reads presents novel challenges to mapping. One such challenge is low sequence similarity between the reads and the reference, due to high sequencing error and mutation rates. This occurs, e.g., in a cancer tumor, or due to differences between strains of viruses or bacter...

Descripción completa

Detalles Bibliográficos
Autores principales: Dutta, Abhinav, Pellow, David, Shamir, Ron
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9645665/
https://www.ncbi.nlm.nih.gov/pubmed/36306319
http://dx.doi.org/10.1371/journal.pcbi.1010638
_version_ 1784827011811770368
author Dutta, Abhinav
Pellow, David
Shamir, Ron
author_facet Dutta, Abhinav
Pellow, David
Shamir, Ron
author_sort Dutta, Abhinav
collection PubMed
description MOTIVATION: Sequencing long reads presents novel challenges to mapping. One such challenge is low sequence similarity between the reads and the reference, due to high sequencing error and mutation rates. This occurs, e.g., in a cancer tumor, or due to differences between strains of viruses or bacteria. A key idea in mapping algorithms is to sketch sequences with their minimizers. Recently, syncmers were introduced as an alternative sketching method that is more robust to mutations and sequencing errors. RESULTS: We introduce parameterized syncmer schemes (PSS), a generalization of syncmers, and provide a theoretical analysis for multi-parameter schemes. By combining PSS with downsampling or minimizers we can achieve any desired compression and window guarantee. We implemented the use of PSS in the popular minimap2 and Winnowmap2 mappers. In tests on simulated and real long-read data from a variety of genomes, the PSS-based algorithms, with scheme parameters selected on the basis of our theoretical analysis, reduced unmapped reads by 20-60% at high compression while usually using less memory. The advantage was more pronounced at low sequence identity. At sequence identity of 75% and medium compression, PSS-minimap had only 37% as many unmapped reads, and 8% fewer of the reads that did map were incorrectly mapped. Even at lower compression and error rates, PSS-based mapping mapped more reads than the original minimizer-based mappers as well as mappers using the original syncmer schemes. We conclude that using PSS can improve mapping of long reads in a wide range of settings.
format Online
Article
Text
id pubmed-9645665
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-96456652022-11-15 Parameterized syncmer schemes improve long-read mapping Dutta, Abhinav Pellow, David Shamir, Ron PLoS Comput Biol Research Article MOTIVATION: Sequencing long reads presents novel challenges to mapping. One such challenge is low sequence similarity between the reads and the reference, due to high sequencing error and mutation rates. This occurs, e.g., in a cancer tumor, or due to differences between strains of viruses or bacteria. A key idea in mapping algorithms is to sketch sequences with their minimizers. Recently, syncmers were introduced as an alternative sketching method that is more robust to mutations and sequencing errors. RESULTS: We introduce parameterized syncmer schemes (PSS), a generalization of syncmers, and provide a theoretical analysis for multi-parameter schemes. By combining PSS with downsampling or minimizers we can achieve any desired compression and window guarantee. We implemented the use of PSS in the popular minimap2 and Winnowmap2 mappers. In tests on simulated and real long-read data from a variety of genomes, the PSS-based algorithms, with scheme parameters selected on the basis of our theoretical analysis, reduced unmapped reads by 20-60% at high compression while usually using less memory. The advantage was more pronounced at low sequence identity. At sequence identity of 75% and medium compression, PSS-minimap had only 37% as many unmapped reads, and 8% fewer of the reads that did map were incorrectly mapped. Even at lower compression and error rates, PSS-based mapping mapped more reads than the original minimizer-based mappers as well as mappers using the original syncmer schemes. We conclude that using PSS can improve mapping of long reads in a wide range of settings. Public Library of Science 2022-10-28 /pmc/articles/PMC9645665/ /pubmed/36306319 http://dx.doi.org/10.1371/journal.pcbi.1010638 Text en © 2022 Dutta et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Dutta, Abhinav
Pellow, David
Shamir, Ron
Parameterized syncmer schemes improve long-read mapping
title Parameterized syncmer schemes improve long-read mapping
title_full Parameterized syncmer schemes improve long-read mapping
title_fullStr Parameterized syncmer schemes improve long-read mapping
title_full_unstemmed Parameterized syncmer schemes improve long-read mapping
title_short Parameterized syncmer schemes improve long-read mapping
title_sort parameterized syncmer schemes improve long-read mapping
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9645665/
https://www.ncbi.nlm.nih.gov/pubmed/36306319
http://dx.doi.org/10.1371/journal.pcbi.1010638
work_keys_str_mv AT duttaabhinav parameterizedsyncmerschemesimprovelongreadmapping
AT pellowdavid parameterizedsyncmerschemesimprovelongreadmapping
AT shamirron parameterizedsyncmerschemesimprovelongreadmapping