Cargando…

Locality-sensitive bucketing functions for the edit distance

BACKGROUND: Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissim...

Descripción completa

Detalles Bibliográficos
Autores principales: Chen, Ke, Shao, Mingfu
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10367278/
https://www.ncbi.nlm.nih.gov/pubmed/37488600
http://dx.doi.org/10.1186/s13015-023-00234-2
_version_ 1785077354105667584
author Chen, Ke
Shao, Mingfu
author_facet Chen, Ke
Shao, Mingfu
author_sort Chen, Ke
collection PubMed
description BACKGROUND: Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existing k-mer-based bucketing methods have been efficient in processing sequencing data with low error rates, but encounter much reduced sensitivity on data with high error rates. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps. RESULTS: In this paper, we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be [Formula: see text] -sensitive if any two sequences within an edit distance of [Formula: see text] are mapped into at least one shared bucket, and any two sequences with distance at least [Formula: see text] are mapped into disjoint subsets of buckets. We construct locality-sensitive bucketing (LSB) functions with a variety of values of [Formula: see text] and analyze their efficiency with respect to the total number of buckets needed as well as the number of buckets that a specific sequence is mapped to. We also prove lower bounds of these two parameters in different settings and show that some of our constructed LSB functions are optimal. CONCLUSION: These results lay the theoretical foundations for their practical use in analyzing sequences with high error rates while also providing insights for the hardness of designing ungapped LSH functions.
format Online
Article
Text
id pubmed-10367278
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-103672782023-07-26 Locality-sensitive bucketing functions for the edit distance Chen, Ke Shao, Mingfu Algorithms Mol Biol Research BACKGROUND: Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existing k-mer-based bucketing methods have been efficient in processing sequencing data with low error rates, but encounter much reduced sensitivity on data with high error rates. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps. RESULTS: In this paper, we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be [Formula: see text] -sensitive if any two sequences within an edit distance of [Formula: see text] are mapped into at least one shared bucket, and any two sequences with distance at least [Formula: see text] are mapped into disjoint subsets of buckets. We construct locality-sensitive bucketing (LSB) functions with a variety of values of [Formula: see text] and analyze their efficiency with respect to the total number of buckets needed as well as the number of buckets that a specific sequence is mapped to. We also prove lower bounds of these two parameters in different settings and show that some of our constructed LSB functions are optimal. CONCLUSION: These results lay the theoretical foundations for their practical use in analyzing sequences with high error rates while also providing insights for the hardness of designing ungapped LSH functions. BioMed Central 2023-07-24 /pmc/articles/PMC10367278/ /pubmed/37488600 http://dx.doi.org/10.1186/s13015-023-00234-2 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Chen, Ke
Shao, Mingfu
Locality-sensitive bucketing functions for the edit distance
title Locality-sensitive bucketing functions for the edit distance
title_full Locality-sensitive bucketing functions for the edit distance
title_fullStr Locality-sensitive bucketing functions for the edit distance
title_full_unstemmed Locality-sensitive bucketing functions for the edit distance
title_short Locality-sensitive bucketing functions for the edit distance
title_sort locality-sensitive bucketing functions for the edit distance
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10367278/
https://www.ncbi.nlm.nih.gov/pubmed/37488600
http://dx.doi.org/10.1186/s13015-023-00234-2
work_keys_str_mv AT chenke localitysensitivebucketingfunctionsfortheeditdistance
AT shaomingfu localitysensitivebucketingfunctionsfortheeditdistance