Cargando…
ntHash2: recursive spaced seed hashing for nucleotide sequences
MOTIVATION: Spaced seeds are robust alternatives to k-mers in analyzing nucleotide sequences with high base mismatch rates. Hashing is also crucial for efficiently storing abundant sequence data. Here, we introduce ntHash2, a fast algorithm for spaced seed hashing that can be integrated into various...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9563681/ https://www.ncbi.nlm.nih.gov/pubmed/36000872 http://dx.doi.org/10.1093/bioinformatics/btac564 |
_version_ | 1784808462243332096 |
---|---|
author | Kazemi, Parham Wong, Johnathan Nikolić, Vladimir Mohamadi, Hamid Warren, René L Birol, Inanç |
author_facet | Kazemi, Parham Wong, Johnathan Nikolić, Vladimir Mohamadi, Hamid Warren, René L Birol, Inanç |
author_sort | Kazemi, Parham |
collection | PubMed |
description | MOTIVATION: Spaced seeds are robust alternatives to k-mers in analyzing nucleotide sequences with high base mismatch rates. Hashing is also crucial for efficiently storing abundant sequence data. Here, we introduce ntHash2, a fast algorithm for spaced seed hashing that can be integrated into various bioinformatics tools for efficient sequence analysis with applications in genome research. RESULTS: ntHash2 is up to 2.1× faster at hashing various spaced seeds than the previous version and 3.8× faster than conventional hashing algorithms with naïve adaptation. Additionally, we reduced the collision rate of ntHash for longer k-mer lengths and improved the uniformity of the hash distribution by modifying the canonical hashing mechanism. AVAILABILITY AND IMPLEMENTATION: ntHash2 is freely available online at github.com/bcgsc/ntHash under an MIT license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. |
format | Online Article Text |
id | pubmed-9563681 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-95636812022-10-18 ntHash2: recursive spaced seed hashing for nucleotide sequences Kazemi, Parham Wong, Johnathan Nikolić, Vladimir Mohamadi, Hamid Warren, René L Birol, Inanç Bioinformatics Applications Notes MOTIVATION: Spaced seeds are robust alternatives to k-mers in analyzing nucleotide sequences with high base mismatch rates. Hashing is also crucial for efficiently storing abundant sequence data. Here, we introduce ntHash2, a fast algorithm for spaced seed hashing that can be integrated into various bioinformatics tools for efficient sequence analysis with applications in genome research. RESULTS: ntHash2 is up to 2.1× faster at hashing various spaced seeds than the previous version and 3.8× faster than conventional hashing algorithms with naïve adaptation. Additionally, we reduced the collision rate of ntHash for longer k-mer lengths and improved the uniformity of the hash distribution by modifying the canonical hashing mechanism. AVAILABILITY AND IMPLEMENTATION: ntHash2 is freely available online at github.com/bcgsc/ntHash under an MIT license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2022-08-24 /pmc/articles/PMC9563681/ /pubmed/36000872 http://dx.doi.org/10.1093/bioinformatics/btac564 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Applications Notes Kazemi, Parham Wong, Johnathan Nikolić, Vladimir Mohamadi, Hamid Warren, René L Birol, Inanç ntHash2: recursive spaced seed hashing for nucleotide sequences |
title | ntHash2: recursive spaced seed hashing for nucleotide sequences |
title_full | ntHash2: recursive spaced seed hashing for nucleotide sequences |
title_fullStr | ntHash2: recursive spaced seed hashing for nucleotide sequences |
title_full_unstemmed | ntHash2: recursive spaced seed hashing for nucleotide sequences |
title_short | ntHash2: recursive spaced seed hashing for nucleotide sequences |
title_sort | nthash2: recursive spaced seed hashing for nucleotide sequences |
topic | Applications Notes |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9563681/ https://www.ncbi.nlm.nih.gov/pubmed/36000872 http://dx.doi.org/10.1093/bioinformatics/btac564 |
work_keys_str_mv | AT kazemiparham nthash2recursivespacedseedhashingfornucleotidesequences AT wongjohnathan nthash2recursivespacedseedhashingfornucleotidesequences AT nikolicvladimir nthash2recursivespacedseedhashingfornucleotidesequences AT mohamadihamid nthash2recursivespacedseedhashingfornucleotidesequences AT warrenrenel nthash2recursivespacedseedhashingfornucleotidesequences AT birolinanc nthash2recursivespacedseedhashingfornucleotidesequences |