Cargando…
Parameterized syncmer schemes improve long-read mapping
MOTIVATION: Sequencing long reads presents novel challenges to mapping. One such challenge is low sequence similarity between the reads and the reference, due to high sequencing error and mutation rates. This occurs, e.g., in a cancer tumor, or due to differences between strains of viruses or bacter...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9645665/ https://www.ncbi.nlm.nih.gov/pubmed/36306319 http://dx.doi.org/10.1371/journal.pcbi.1010638 |
_version_ | 1784827011811770368 |
---|---|
author | Dutta, Abhinav Pellow, David Shamir, Ron |
author_facet | Dutta, Abhinav Pellow, David Shamir, Ron |
author_sort | Dutta, Abhinav |
collection | PubMed |
description | MOTIVATION: Sequencing long reads presents novel challenges to mapping. One such challenge is low sequence similarity between the reads and the reference, due to high sequencing error and mutation rates. This occurs, e.g., in a cancer tumor, or due to differences between strains of viruses or bacteria. A key idea in mapping algorithms is to sketch sequences with their minimizers. Recently, syncmers were introduced as an alternative sketching method that is more robust to mutations and sequencing errors. RESULTS: We introduce parameterized syncmer schemes (PSS), a generalization of syncmers, and provide a theoretical analysis for multi-parameter schemes. By combining PSS with downsampling or minimizers we can achieve any desired compression and window guarantee. We implemented the use of PSS in the popular minimap2 and Winnowmap2 mappers. In tests on simulated and real long-read data from a variety of genomes, the PSS-based algorithms, with scheme parameters selected on the basis of our theoretical analysis, reduced unmapped reads by 20-60% at high compression while usually using less memory. The advantage was more pronounced at low sequence identity. At sequence identity of 75% and medium compression, PSS-minimap had only 37% as many unmapped reads, and 8% fewer of the reads that did map were incorrectly mapped. Even at lower compression and error rates, PSS-based mapping mapped more reads than the original minimizer-based mappers as well as mappers using the original syncmer schemes. We conclude that using PSS can improve mapping of long reads in a wide range of settings. |
format | Online Article Text |
id | pubmed-9645665 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-96456652022-11-15 Parameterized syncmer schemes improve long-read mapping Dutta, Abhinav Pellow, David Shamir, Ron PLoS Comput Biol Research Article MOTIVATION: Sequencing long reads presents novel challenges to mapping. One such challenge is low sequence similarity between the reads and the reference, due to high sequencing error and mutation rates. This occurs, e.g., in a cancer tumor, or due to differences between strains of viruses or bacteria. A key idea in mapping algorithms is to sketch sequences with their minimizers. Recently, syncmers were introduced as an alternative sketching method that is more robust to mutations and sequencing errors. RESULTS: We introduce parameterized syncmer schemes (PSS), a generalization of syncmers, and provide a theoretical analysis for multi-parameter schemes. By combining PSS with downsampling or minimizers we can achieve any desired compression and window guarantee. We implemented the use of PSS in the popular minimap2 and Winnowmap2 mappers. In tests on simulated and real long-read data from a variety of genomes, the PSS-based algorithms, with scheme parameters selected on the basis of our theoretical analysis, reduced unmapped reads by 20-60% at high compression while usually using less memory. The advantage was more pronounced at low sequence identity. At sequence identity of 75% and medium compression, PSS-minimap had only 37% as many unmapped reads, and 8% fewer of the reads that did map were incorrectly mapped. Even at lower compression and error rates, PSS-based mapping mapped more reads than the original minimizer-based mappers as well as mappers using the original syncmer schemes. We conclude that using PSS can improve mapping of long reads in a wide range of settings. Public Library of Science 2022-10-28 /pmc/articles/PMC9645665/ /pubmed/36306319 http://dx.doi.org/10.1371/journal.pcbi.1010638 Text en © 2022 Dutta et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Dutta, Abhinav Pellow, David Shamir, Ron Parameterized syncmer schemes improve long-read mapping |
title | Parameterized syncmer schemes improve long-read mapping |
title_full | Parameterized syncmer schemes improve long-read mapping |
title_fullStr | Parameterized syncmer schemes improve long-read mapping |
title_full_unstemmed | Parameterized syncmer schemes improve long-read mapping |
title_short | Parameterized syncmer schemes improve long-read mapping |
title_sort | parameterized syncmer schemes improve long-read mapping |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9645665/ https://www.ncbi.nlm.nih.gov/pubmed/36306319 http://dx.doi.org/10.1371/journal.pcbi.1010638 |
work_keys_str_mv | AT duttaabhinav parameterizedsyncmerschemesimprovelongreadmapping AT pellowdavid parameterizedsyncmerschemesimprovelongreadmapping AT shamirron parameterizedsyncmerschemesimprovelongreadmapping |