Cargando…

Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation

MOTIVATION: The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing usef...

Descripción completa

Detalles Bibliográficos
Autores principales: Kille, Bryce, Garrison, Erik, Treangen, Todd J, Phillippy, Adam M
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10505501/
https://www.ncbi.nlm.nih.gov/pubmed/37603771
http://dx.doi.org/10.1093/bioinformatics/btad512
_version_ 1785106924731105280
author Kille, Bryce
Garrison, Erik
Treangen, Todd J
Phillippy, Adam M
author_facet Kille, Bryce
Garrison, Erik
Treangen, Todd J
Phillippy, Adam M
author_sort Kille, Bryce
collection PubMed
description MOTIVATION: The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates. RESULTS: To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications. AVAILABILITY AND IMPLEMENTATION: MashMap3 is available at https://github.com/marbl/MashMap.
format Online
Article
Text
id pubmed-10505501
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-105055012023-09-18 Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation Kille, Bryce Garrison, Erik Treangen, Todd J Phillippy, Adam M Bioinformatics Original Paper MOTIVATION: The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates. RESULTS: To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications. AVAILABILITY AND IMPLEMENTATION: MashMap3 is available at https://github.com/marbl/MashMap. Oxford University Press 2023-08-21 /pmc/articles/PMC10505501/ /pubmed/37603771 http://dx.doi.org/10.1093/bioinformatics/btad512 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Paper
Kille, Bryce
Garrison, Erik
Treangen, Todd J
Phillippy, Adam M
Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation
title Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation
title_full Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation
title_fullStr Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation
title_full_unstemmed Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation
title_short Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation
title_sort minmers are a generalization of minimizers that enable unbiased local jaccard estimation
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10505501/
https://www.ncbi.nlm.nih.gov/pubmed/37603771
http://dx.doi.org/10.1093/bioinformatics/btad512
work_keys_str_mv AT killebryce minmersareageneralizationofminimizersthatenableunbiasedlocaljaccardestimation
AT garrisonerik minmersareageneralizationofminimizersthatenableunbiasedlocaljaccardestimation
AT treangentoddj minmersareageneralizationofminimizersthatenableunbiasedlocaljaccardestimation
AT phillippyadamm minmersareageneralizationofminimizersthatenableunbiasedlocaljaccardestimation