Cargando…

ntHash2: recursive spaced seed hashing for nucleotide sequences

MOTIVATION: Spaced seeds are robust alternatives to k-mers in analyzing nucleotide sequences with high base mismatch rates. Hashing is also crucial for efficiently storing abundant sequence data. Here, we introduce ntHash2, a fast algorithm for spaced seed hashing that can be integrated into various...

Descripción completa

Detalles Bibliográficos
Autores principales: Kazemi, Parham, Wong, Johnathan, Nikolić, Vladimir, Mohamadi, Hamid, Warren, René L, Birol, Inanç
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9563681/
https://www.ncbi.nlm.nih.gov/pubmed/36000872
http://dx.doi.org/10.1093/bioinformatics/btac564
_version_ 1784808462243332096
author Kazemi, Parham
Wong, Johnathan
Nikolić, Vladimir
Mohamadi, Hamid
Warren, René L
Birol, Inanç
author_facet Kazemi, Parham
Wong, Johnathan
Nikolić, Vladimir
Mohamadi, Hamid
Warren, René L
Birol, Inanç
author_sort Kazemi, Parham
collection PubMed
description MOTIVATION: Spaced seeds are robust alternatives to k-mers in analyzing nucleotide sequences with high base mismatch rates. Hashing is also crucial for efficiently storing abundant sequence data. Here, we introduce ntHash2, a fast algorithm for spaced seed hashing that can be integrated into various bioinformatics tools for efficient sequence analysis with applications in genome research. RESULTS: ntHash2 is up to 2.1× faster at hashing various spaced seeds than the previous version and 3.8× faster than conventional hashing algorithms with naïve adaptation. Additionally, we reduced the collision rate of ntHash for longer k-mer lengths and improved the uniformity of the hash distribution by modifying the canonical hashing mechanism. AVAILABILITY AND IMPLEMENTATION: ntHash2 is freely available online at github.com/bcgsc/ntHash under an MIT license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-9563681
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-95636812022-10-18 ntHash2: recursive spaced seed hashing for nucleotide sequences Kazemi, Parham Wong, Johnathan Nikolić, Vladimir Mohamadi, Hamid Warren, René L Birol, Inanç Bioinformatics Applications Notes MOTIVATION: Spaced seeds are robust alternatives to k-mers in analyzing nucleotide sequences with high base mismatch rates. Hashing is also crucial for efficiently storing abundant sequence data. Here, we introduce ntHash2, a fast algorithm for spaced seed hashing that can be integrated into various bioinformatics tools for efficient sequence analysis with applications in genome research. RESULTS: ntHash2 is up to 2.1× faster at hashing various spaced seeds than the previous version and 3.8× faster than conventional hashing algorithms with naïve adaptation. Additionally, we reduced the collision rate of ntHash for longer k-mer lengths and improved the uniformity of the hash distribution by modifying the canonical hashing mechanism. AVAILABILITY AND IMPLEMENTATION: ntHash2 is freely available online at github.com/bcgsc/ntHash under an MIT license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2022-08-24 /pmc/articles/PMC9563681/ /pubmed/36000872 http://dx.doi.org/10.1093/bioinformatics/btac564 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Applications Notes
Kazemi, Parham
Wong, Johnathan
Nikolić, Vladimir
Mohamadi, Hamid
Warren, René L
Birol, Inanç
ntHash2: recursive spaced seed hashing for nucleotide sequences
title ntHash2: recursive spaced seed hashing for nucleotide sequences
title_full ntHash2: recursive spaced seed hashing for nucleotide sequences
title_fullStr ntHash2: recursive spaced seed hashing for nucleotide sequences
title_full_unstemmed ntHash2: recursive spaced seed hashing for nucleotide sequences
title_short ntHash2: recursive spaced seed hashing for nucleotide sequences
title_sort nthash2: recursive spaced seed hashing for nucleotide sequences
topic Applications Notes
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9563681/
https://www.ncbi.nlm.nih.gov/pubmed/36000872
http://dx.doi.org/10.1093/bioinformatics/btac564
work_keys_str_mv AT kazemiparham nthash2recursivespacedseedhashingfornucleotidesequences
AT wongjohnathan nthash2recursivespacedseedhashingfornucleotidesequences
AT nikolicvladimir nthash2recursivespacedseedhashingfornucleotidesequences
AT mohamadihamid nthash2recursivespacedseedhashingfornucleotidesequences
AT warrenrenel nthash2recursivespacedseedhashingfornucleotidesequences
AT birolinanc nthash2recursivespacedseedhashingfornucleotidesequences