Cargando…

TahcoRoll: fast genomic signature profiling via thinned automaton and rolling hash

OBJECTIVES: Genomic signatures like k-mers have become one of the most prominent approaches to describe genomic data. As a result, myriad real-world applications, such as the construction of de Bruijn graphs in genome assembly, have been benefited by recognizing genomic signatures. In other words, a...

Descripción completa

Detalles Bibliográficos
Autores principales: Ju, Chelsea J.-T., Jiang, Jyun-Yu, Li, Ruirui, Li, Zeyu, Wang, Wei
Formato: Online Artículo Texto
Lenguaje:English
Publicado: De Gruyter 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9027990/
https://www.ncbi.nlm.nih.gov/pubmed/35881666
http://dx.doi.org/10.1515/mr-2021-0016
_version_ 1784691505666981888
author Ju, Chelsea J.-T.
Jiang, Jyun-Yu
Li, Ruirui
Li, Zeyu
Wang, Wei
author_facet Ju, Chelsea J.-T.
Jiang, Jyun-Yu
Li, Ruirui
Li, Zeyu
Wang, Wei
author_sort Ju, Chelsea J.-T.
collection PubMed
description OBJECTIVES: Genomic signatures like k-mers have become one of the most prominent approaches to describe genomic data. As a result, myriad real-world applications, such as the construction of de Bruijn graphs in genome assembly, have been benefited by recognizing genomic signatures. In other words, an efficient approach of genomic signature profiling is an essential need for tackling high-throughput sequencing reads. However, most of the existing approaches only recognize fixed-size k-mers while many research studies have shown the importance of considering variable-length k-mers. METHODS: In this paper, we present a novel genomic signature profiling approach, TahcoRoll, by extending the Aho–Corasick algorithm (AC) for the task of profiling variable-length k-mers. We first group nucleotides into two clusters and represent each cluster with a bit. The rolling hash technique is further utilized to encode signatures and read patterns for efficient matching. RESULTS: In extensive experiments, TahcoRoll significantly outperforms the most state-of-the-art k-mer counters and has the capability of processing reads across different sequencing platforms on a budget desktop computer. CONCLUSIONS: The single-thread version of TahcoRoll is as efficient as the eight-thread version of the state-of-the-art, JellyFish, while the eight-thread TahcoRoll outperforms the eight-thread JellyFish by at least four times.
format Online
Article
Text
id pubmed-9027990
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher De Gruyter
record_format MEDLINE/PubMed
spelling pubmed-90279902022-05-25 TahcoRoll: fast genomic signature profiling via thinned automaton and rolling hash Ju, Chelsea J.-T. Jiang, Jyun-Yu Li, Ruirui Li, Zeyu Wang, Wei Med Rev (Berl) Research Article OBJECTIVES: Genomic signatures like k-mers have become one of the most prominent approaches to describe genomic data. As a result, myriad real-world applications, such as the construction of de Bruijn graphs in genome assembly, have been benefited by recognizing genomic signatures. In other words, an efficient approach of genomic signature profiling is an essential need for tackling high-throughput sequencing reads. However, most of the existing approaches only recognize fixed-size k-mers while many research studies have shown the importance of considering variable-length k-mers. METHODS: In this paper, we present a novel genomic signature profiling approach, TahcoRoll, by extending the Aho–Corasick algorithm (AC) for the task of profiling variable-length k-mers. We first group nucleotides into two clusters and represent each cluster with a bit. The rolling hash technique is further utilized to encode signatures and read patterns for efficient matching. RESULTS: In extensive experiments, TahcoRoll significantly outperforms the most state-of-the-art k-mer counters and has the capability of processing reads across different sequencing platforms on a budget desktop computer. CONCLUSIONS: The single-thread version of TahcoRoll is as efficient as the eight-thread version of the state-of-the-art, JellyFish, while the eight-thread TahcoRoll outperforms the eight-thread JellyFish by at least four times. De Gruyter 2022-02-14 /pmc/articles/PMC9027990/ /pubmed/35881666 http://dx.doi.org/10.1515/mr-2021-0016 Text en © 2021 Chelsea J.-T. Ju et al., published by De Gruyter, Berlin/Boston https://creativecommons.org/licenses/by-nc-nd/4.0/This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
spellingShingle Research Article
Ju, Chelsea J.-T.
Jiang, Jyun-Yu
Li, Ruirui
Li, Zeyu
Wang, Wei
TahcoRoll: fast genomic signature profiling via thinned automaton and rolling hash
title TahcoRoll: fast genomic signature profiling via thinned automaton and rolling hash
title_full TahcoRoll: fast genomic signature profiling via thinned automaton and rolling hash
title_fullStr TahcoRoll: fast genomic signature profiling via thinned automaton and rolling hash
title_full_unstemmed TahcoRoll: fast genomic signature profiling via thinned automaton and rolling hash
title_short TahcoRoll: fast genomic signature profiling via thinned automaton and rolling hash
title_sort tahcoroll: fast genomic signature profiling via thinned automaton and rolling hash
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9027990/
https://www.ncbi.nlm.nih.gov/pubmed/35881666
http://dx.doi.org/10.1515/mr-2021-0016
work_keys_str_mv AT juchelseajt tahcorollfastgenomicsignatureprofilingviathinnedautomatonandrollinghash
AT jiangjyunyu tahcorollfastgenomicsignatureprofilingviathinnedautomatonandrollinghash
AT liruirui tahcorollfastgenomicsignatureprofilingviathinnedautomatonandrollinghash
AT lizeyu tahcorollfastgenomicsignatureprofilingviathinnedautomatonandrollinghash
AT wangwei tahcorollfastgenomicsignatureprofilingviathinnedautomatonandrollinghash