Cargando…

LexicHash: sequence similarity estimation via lexicographic comparison of hashes

MOTIVATION: Pairwise sequence alignment is a heavy computational burden, particularly in the context of third-generation sequencing technologies. This issue is commonly addressed by approximately estimating sequence similarities using a hash-based method such as MinHash. In MinHash, all k-mers in a...

Descripción completa

Detalles Bibliográficos
Autores principales:	Greenberg, Grant, Ravi, Aditya Narayan, Shomorony, Ilan
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2023
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10628434/ https://www.ncbi.nlm.nih.gov/pubmed/37878809 http://dx.doi.org/10.1093/bioinformatics/btad652

_version_	1785131757623836672
author	Greenberg, Grant Ravi, Aditya Narayan Shomorony, Ilan
author_facet	Greenberg, Grant Ravi, Aditya Narayan Shomorony, Ilan
author_sort	Greenberg, Grant
collection	PubMed
description	MOTIVATION: Pairwise sequence alignment is a heavy computational burden, particularly in the context of third-generation sequencing technologies. This issue is commonly addressed by approximately estimating sequence similarities using a hash-based method such as MinHash. In MinHash, all k-mers in a read are hashed and the minimum hash value, the min-hash, is stored. Pairwise similarities can then be estimated by counting the number of min-hash matches between a pair of reads, across many distinct hash functions. The choice of the parameter k controls an important tradeoff in the task of identifying alignments: larger k-values give greater confidence in the identification of alignments (high precision) but can lead to many missing alignments (low recall), particularly in the presence of significant noise. RESULTS: In this work, we introduce LexicHash, a new similarity estimation method that is effectively independent of the choice of k and attains the high precision of large-k and the high sensitivity of small-k MinHash. LexicHash is a variant of MinHash with a carefully designed hash function. When estimating the similarity between two reads, instead of simply checking whether min-hashes match (as in standard MinHash), one checks how “lexicographically similar” the LexicHash min-hashes are. In our experiments on 40 PacBio datasets, the area under the precision–recall curves obtained by LexicHash had an average improvement of 20.9% over MinHash. Additionally, the LexicHash framework lends itself naturally to an efficient search of the largest alignments, yielding an [Formula: see text] time algorithm, and circumventing the seemingly fundamental [Formula: see text] scaling associated with pairwise similarity search. AVAILABILITY AND IMPLEMENTATION: LexicHash is available on GitHub at https://github.com/gcgreenberg/LexicHash.
format	Online Article Text
id	pubmed-10628434
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-106284342023-11-08 LexicHash: sequence similarity estimation via lexicographic comparison of hashes Greenberg, Grant Ravi, Aditya Narayan Shomorony, Ilan Bioinformatics Original Paper MOTIVATION: Pairwise sequence alignment is a heavy computational burden, particularly in the context of third-generation sequencing technologies. This issue is commonly addressed by approximately estimating sequence similarities using a hash-based method such as MinHash. In MinHash, all k-mers in a read are hashed and the minimum hash value, the min-hash, is stored. Pairwise similarities can then be estimated by counting the number of min-hash matches between a pair of reads, across many distinct hash functions. The choice of the parameter k controls an important tradeoff in the task of identifying alignments: larger k-values give greater confidence in the identification of alignments (high precision) but can lead to many missing alignments (low recall), particularly in the presence of significant noise. RESULTS: In this work, we introduce LexicHash, a new similarity estimation method that is effectively independent of the choice of k and attains the high precision of large-k and the high sensitivity of small-k MinHash. LexicHash is a variant of MinHash with a carefully designed hash function. When estimating the similarity between two reads, instead of simply checking whether min-hashes match (as in standard MinHash), one checks how “lexicographically similar” the LexicHash min-hashes are. In our experiments on 40 PacBio datasets, the area under the precision–recall curves obtained by LexicHash had an average improvement of 20.9% over MinHash. Additionally, the LexicHash framework lends itself naturally to an efficient search of the largest alignments, yielding an [Formula: see text] time algorithm, and circumventing the seemingly fundamental [Formula: see text] scaling associated with pairwise similarity search. AVAILABILITY AND IMPLEMENTATION: LexicHash is available on GitHub at https://github.com/gcgreenberg/LexicHash. Oxford University Press 2023-10-25 /pmc/articles/PMC10628434/ /pubmed/37878809 http://dx.doi.org/10.1093/bioinformatics/btad652 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Paper Greenberg, Grant Ravi, Aditya Narayan Shomorony, Ilan LexicHash: sequence similarity estimation via lexicographic comparison of hashes
title	LexicHash: sequence similarity estimation via lexicographic comparison of hashes
title_full	LexicHash: sequence similarity estimation via lexicographic comparison of hashes
title_fullStr	LexicHash: sequence similarity estimation via lexicographic comparison of hashes
title_full_unstemmed	LexicHash: sequence similarity estimation via lexicographic comparison of hashes
title_short	LexicHash: sequence similarity estimation via lexicographic comparison of hashes
title_sort	lexichash: sequence similarity estimation via lexicographic comparison of hashes
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10628434/ https://www.ncbi.nlm.nih.gov/pubmed/37878809 http://dx.doi.org/10.1093/bioinformatics/btad652
work_keys_str_mv	AT greenberggrant lexichashsequencesimilarityestimationvialexicographiccomparisonofhashes AT raviadityanarayan lexichashsequencesimilarityestimationvialexicographiccomparisonofhashes AT shomoronyilan lexichashsequencesimilarityestimationvialexicographiccomparisonofhashes

LexicHash: sequence similarity estimation via lexicographic comparison of hashes

Ejemplares similares