Cargando…
Locality-sensitive bucketing functions for the edit distance
BACKGROUND: Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissim...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10367278/ https://www.ncbi.nlm.nih.gov/pubmed/37488600 http://dx.doi.org/10.1186/s13015-023-00234-2 |
_version_ | 1785077354105667584 |
---|---|
author | Chen, Ke Shao, Mingfu |
author_facet | Chen, Ke Shao, Mingfu |
author_sort | Chen, Ke |
collection | PubMed |
description | BACKGROUND: Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existing k-mer-based bucketing methods have been efficient in processing sequencing data with low error rates, but encounter much reduced sensitivity on data with high error rates. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps. RESULTS: In this paper, we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be [Formula: see text] -sensitive if any two sequences within an edit distance of [Formula: see text] are mapped into at least one shared bucket, and any two sequences with distance at least [Formula: see text] are mapped into disjoint subsets of buckets. We construct locality-sensitive bucketing (LSB) functions with a variety of values of [Formula: see text] and analyze their efficiency with respect to the total number of buckets needed as well as the number of buckets that a specific sequence is mapped to. We also prove lower bounds of these two parameters in different settings and show that some of our constructed LSB functions are optimal. CONCLUSION: These results lay the theoretical foundations for their practical use in analyzing sequences with high error rates while also providing insights for the hardness of designing ungapped LSH functions. |
format | Online Article Text |
id | pubmed-10367278 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-103672782023-07-26 Locality-sensitive bucketing functions for the edit distance Chen, Ke Shao, Mingfu Algorithms Mol Biol Research BACKGROUND: Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existing k-mer-based bucketing methods have been efficient in processing sequencing data with low error rates, but encounter much reduced sensitivity on data with high error rates. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps. RESULTS: In this paper, we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be [Formula: see text] -sensitive if any two sequences within an edit distance of [Formula: see text] are mapped into at least one shared bucket, and any two sequences with distance at least [Formula: see text] are mapped into disjoint subsets of buckets. We construct locality-sensitive bucketing (LSB) functions with a variety of values of [Formula: see text] and analyze their efficiency with respect to the total number of buckets needed as well as the number of buckets that a specific sequence is mapped to. We also prove lower bounds of these two parameters in different settings and show that some of our constructed LSB functions are optimal. CONCLUSION: These results lay the theoretical foundations for their practical use in analyzing sequences with high error rates while also providing insights for the hardness of designing ungapped LSH functions. BioMed Central 2023-07-24 /pmc/articles/PMC10367278/ /pubmed/37488600 http://dx.doi.org/10.1186/s13015-023-00234-2 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Chen, Ke Shao, Mingfu Locality-sensitive bucketing functions for the edit distance |
title | Locality-sensitive bucketing functions for the edit distance |
title_full | Locality-sensitive bucketing functions for the edit distance |
title_fullStr | Locality-sensitive bucketing functions for the edit distance |
title_full_unstemmed | Locality-sensitive bucketing functions for the edit distance |
title_short | Locality-sensitive bucketing functions for the edit distance |
title_sort | locality-sensitive bucketing functions for the edit distance |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10367278/ https://www.ncbi.nlm.nih.gov/pubmed/37488600 http://dx.doi.org/10.1186/s13015-023-00234-2 |
work_keys_str_mv | AT chenke localitysensitivebucketingfunctionsfortheeditdistance AT shaomingfu localitysensitivebucketingfunctionsfortheeditdistance |