Cargando…
Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation
MOTIVATION: The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing usef...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10505501/ https://www.ncbi.nlm.nih.gov/pubmed/37603771 http://dx.doi.org/10.1093/bioinformatics/btad512 |
_version_ | 1785106924731105280 |
---|---|
author | Kille, Bryce Garrison, Erik Treangen, Todd J Phillippy, Adam M |
author_facet | Kille, Bryce Garrison, Erik Treangen, Todd J Phillippy, Adam M |
author_sort | Kille, Bryce |
collection | PubMed |
description | MOTIVATION: The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates. RESULTS: To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications. AVAILABILITY AND IMPLEMENTATION: MashMap3 is available at https://github.com/marbl/MashMap. |
format | Online Article Text |
id | pubmed-10505501 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-105055012023-09-18 Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation Kille, Bryce Garrison, Erik Treangen, Todd J Phillippy, Adam M Bioinformatics Original Paper MOTIVATION: The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates. RESULTS: To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications. AVAILABILITY AND IMPLEMENTATION: MashMap3 is available at https://github.com/marbl/MashMap. Oxford University Press 2023-08-21 /pmc/articles/PMC10505501/ /pubmed/37603771 http://dx.doi.org/10.1093/bioinformatics/btad512 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Paper Kille, Bryce Garrison, Erik Treangen, Todd J Phillippy, Adam M Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation |
title | Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation |
title_full | Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation |
title_fullStr | Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation |
title_full_unstemmed | Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation |
title_short | Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation |
title_sort | minmers are a generalization of minimizers that enable unbiased local jaccard estimation |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10505501/ https://www.ncbi.nlm.nih.gov/pubmed/37603771 http://dx.doi.org/10.1093/bioinformatics/btad512 |
work_keys_str_mv | AT killebryce minmersareageneralizationofminimizersthatenableunbiasedlocaljaccardestimation AT garrisonerik minmersareageneralizationofminimizersthatenableunbiasedlocaljaccardestimation AT treangentoddj minmersareageneralizationofminimizersthatenableunbiasedlocaljaccardestimation AT phillippyadamm minmersareageneralizationofminimizersthatenableunbiasedlocaljaccardestimation |