Cargando…

Entropy predicts sensitivity of pseudorandom seeds

Seed design is important for sequence similarity search applications such as read mapping and average nucleotide identity (ANI) estimation. Although k-mers and spaced k-mers are likely the most well-known and used seeds, sensitivity suffers at high error rates, particularly when indels are present....

Descripción completa

Detalles Bibliográficos
Autores principales: Maier, Benjamin Dominik, Sahlin, Kristoffer
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10538493/
https://www.ncbi.nlm.nih.gov/pubmed/37217253
http://dx.doi.org/10.1101/gr.277645.123
_version_ 1785113318167412736
author Maier, Benjamin Dominik
Sahlin, Kristoffer
author_facet Maier, Benjamin Dominik
Sahlin, Kristoffer
author_sort Maier, Benjamin Dominik
collection PubMed
description Seed design is important for sequence similarity search applications such as read mapping and average nucleotide identity (ANI) estimation. Although k-mers and spaced k-mers are likely the most well-known and used seeds, sensitivity suffers at high error rates, particularly when indels are present. Recently, we developed a pseudorandom seeding construct, strobemers, which was empirically shown to have high sensitivity also at high indel rates. However, the study lacked a deeper understanding of why. In this study, we propose a model to estimate the entropy of a seed and find that seeds with high entropy, according to our model, in most cases have high match sensitivity. Our discovered seed randomness–sensitivity relationship explains why some seeds perform better than others, and the relationship provides a framework for designing even more sensitive seeds. We also present three new strobemer seed constructs: mixedstrobes, altstrobes, and multistrobes. We use both simulated and biological data to show that our new seed constructs improve sequence-matching sensitivity to other strobemers. We show that the three new seed constructs are useful for read mapping and ANI estimation. For read mapping, we implement strobemers into minimap2 and observe 30% faster alignment time and 0.2% higher accuracy than using k-mers when mapping reads at high error rates. As for ANI estimation, we find that higher entropy seeds have a higher rank correlation between estimated and true ANI.
format Online
Article
Text
id pubmed-10538493
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cold Spring Harbor Laboratory Press
record_format MEDLINE/PubMed
spelling pubmed-105384932023-09-29 Entropy predicts sensitivity of pseudorandom seeds Maier, Benjamin Dominik Sahlin, Kristoffer Genome Res Methods Seed design is important for sequence similarity search applications such as read mapping and average nucleotide identity (ANI) estimation. Although k-mers and spaced k-mers are likely the most well-known and used seeds, sensitivity suffers at high error rates, particularly when indels are present. Recently, we developed a pseudorandom seeding construct, strobemers, which was empirically shown to have high sensitivity also at high indel rates. However, the study lacked a deeper understanding of why. In this study, we propose a model to estimate the entropy of a seed and find that seeds with high entropy, according to our model, in most cases have high match sensitivity. Our discovered seed randomness–sensitivity relationship explains why some seeds perform better than others, and the relationship provides a framework for designing even more sensitive seeds. We also present three new strobemer seed constructs: mixedstrobes, altstrobes, and multistrobes. We use both simulated and biological data to show that our new seed constructs improve sequence-matching sensitivity to other strobemers. We show that the three new seed constructs are useful for read mapping and ANI estimation. For read mapping, we implement strobemers into minimap2 and observe 30% faster alignment time and 0.2% higher accuracy than using k-mers when mapping reads at high error rates. As for ANI estimation, we find that higher entropy seeds have a higher rank correlation between estimated and true ANI. Cold Spring Harbor Laboratory Press 2023-07 /pmc/articles/PMC10538493/ /pubmed/37217253 http://dx.doi.org/10.1101/gr.277645.123 Text en © 2023 Maier and Sahlin; Published by Cold Spring Harbor Laboratory Press https://creativecommons.org/licenses/by/4.0/This article, published in Genome Research, is available under a Creative Commons License (Attribution 4.0 International), as described at http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Methods
Maier, Benjamin Dominik
Sahlin, Kristoffer
Entropy predicts sensitivity of pseudorandom seeds
title Entropy predicts sensitivity of pseudorandom seeds
title_full Entropy predicts sensitivity of pseudorandom seeds
title_fullStr Entropy predicts sensitivity of pseudorandom seeds
title_full_unstemmed Entropy predicts sensitivity of pseudorandom seeds
title_short Entropy predicts sensitivity of pseudorandom seeds
title_sort entropy predicts sensitivity of pseudorandom seeds
topic Methods
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10538493/
https://www.ncbi.nlm.nih.gov/pubmed/37217253
http://dx.doi.org/10.1101/gr.277645.123
work_keys_str_mv AT maierbenjamindominik entropypredictssensitivityofpseudorandomseeds
AT sahlinkristoffer entropypredictssensitivityofpseudorandomseeds