Cargando…
TahcoRoll: fast genomic signature profiling via thinned automaton and rolling hash
OBJECTIVES: Genomic signatures like k-mers have become one of the most prominent approaches to describe genomic data. As a result, myriad real-world applications, such as the construction of de Bruijn graphs in genome assembly, have been benefited by recognizing genomic signatures. In other words, a...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
De Gruyter
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9027990/ https://www.ncbi.nlm.nih.gov/pubmed/35881666 http://dx.doi.org/10.1515/mr-2021-0016 |
_version_ | 1784691505666981888 |
---|---|
author | Ju, Chelsea J.-T. Jiang, Jyun-Yu Li, Ruirui Li, Zeyu Wang, Wei |
author_facet | Ju, Chelsea J.-T. Jiang, Jyun-Yu Li, Ruirui Li, Zeyu Wang, Wei |
author_sort | Ju, Chelsea J.-T. |
collection | PubMed |
description | OBJECTIVES: Genomic signatures like k-mers have become one of the most prominent approaches to describe genomic data. As a result, myriad real-world applications, such as the construction of de Bruijn graphs in genome assembly, have been benefited by recognizing genomic signatures. In other words, an efficient approach of genomic signature profiling is an essential need for tackling high-throughput sequencing reads. However, most of the existing approaches only recognize fixed-size k-mers while many research studies have shown the importance of considering variable-length k-mers. METHODS: In this paper, we present a novel genomic signature profiling approach, TahcoRoll, by extending the Aho–Corasick algorithm (AC) for the task of profiling variable-length k-mers. We first group nucleotides into two clusters and represent each cluster with a bit. The rolling hash technique is further utilized to encode signatures and read patterns for efficient matching. RESULTS: In extensive experiments, TahcoRoll significantly outperforms the most state-of-the-art k-mer counters and has the capability of processing reads across different sequencing platforms on a budget desktop computer. CONCLUSIONS: The single-thread version of TahcoRoll is as efficient as the eight-thread version of the state-of-the-art, JellyFish, while the eight-thread TahcoRoll outperforms the eight-thread JellyFish by at least four times. |
format | Online Article Text |
id | pubmed-9027990 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | De Gruyter |
record_format | MEDLINE/PubMed |
spelling | pubmed-90279902022-05-25 TahcoRoll: fast genomic signature profiling via thinned automaton and rolling hash Ju, Chelsea J.-T. Jiang, Jyun-Yu Li, Ruirui Li, Zeyu Wang, Wei Med Rev (Berl) Research Article OBJECTIVES: Genomic signatures like k-mers have become one of the most prominent approaches to describe genomic data. As a result, myriad real-world applications, such as the construction of de Bruijn graphs in genome assembly, have been benefited by recognizing genomic signatures. In other words, an efficient approach of genomic signature profiling is an essential need for tackling high-throughput sequencing reads. However, most of the existing approaches only recognize fixed-size k-mers while many research studies have shown the importance of considering variable-length k-mers. METHODS: In this paper, we present a novel genomic signature profiling approach, TahcoRoll, by extending the Aho–Corasick algorithm (AC) for the task of profiling variable-length k-mers. We first group nucleotides into two clusters and represent each cluster with a bit. The rolling hash technique is further utilized to encode signatures and read patterns for efficient matching. RESULTS: In extensive experiments, TahcoRoll significantly outperforms the most state-of-the-art k-mer counters and has the capability of processing reads across different sequencing platforms on a budget desktop computer. CONCLUSIONS: The single-thread version of TahcoRoll is as efficient as the eight-thread version of the state-of-the-art, JellyFish, while the eight-thread TahcoRoll outperforms the eight-thread JellyFish by at least four times. De Gruyter 2022-02-14 /pmc/articles/PMC9027990/ /pubmed/35881666 http://dx.doi.org/10.1515/mr-2021-0016 Text en © 2021 Chelsea J.-T. Ju et al., published by De Gruyter, Berlin/Boston https://creativecommons.org/licenses/by-nc-nd/4.0/This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. |
spellingShingle | Research Article Ju, Chelsea J.-T. Jiang, Jyun-Yu Li, Ruirui Li, Zeyu Wang, Wei TahcoRoll: fast genomic signature profiling via thinned automaton and rolling hash |
title | TahcoRoll: fast genomic signature profiling via thinned automaton and rolling hash |
title_full | TahcoRoll: fast genomic signature profiling via thinned automaton and rolling hash |
title_fullStr | TahcoRoll: fast genomic signature profiling via thinned automaton and rolling hash |
title_full_unstemmed | TahcoRoll: fast genomic signature profiling via thinned automaton and rolling hash |
title_short | TahcoRoll: fast genomic signature profiling via thinned automaton and rolling hash |
title_sort | tahcoroll: fast genomic signature profiling via thinned automaton and rolling hash |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9027990/ https://www.ncbi.nlm.nih.gov/pubmed/35881666 http://dx.doi.org/10.1515/mr-2021-0016 |
work_keys_str_mv | AT juchelseajt tahcorollfastgenomicsignatureprofilingviathinnedautomatonandrollinghash AT jiangjyunyu tahcorollfastgenomicsignatureprofilingviathinnedautomatonandrollinghash AT liruirui tahcorollfastgenomicsignatureprofilingviathinnedautomatonandrollinghash AT lizeyu tahcorollfastgenomicsignatureprofilingviathinnedautomatonandrollinghash AT wangwei tahcorollfastgenomicsignatureprofilingviathinnedautomatonandrollinghash |