Cargando…

Generalized enhanced suffix array construction in external memory

BACKGROUND: Suffix arrays, augmented by additional data structures, allow solving efficiently many string processing problems. The external memory construction of the generalized suffix array for a string collection is a fundamental task when the size of the input collection or the data structure ex...

Descripción completa

Detalles Bibliográficos
Autores principales: Louza, Felipe A., Telles, Guilherme P., Hoffmann, Steve, Ciferri, Cristina D. A.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5719966/
https://www.ncbi.nlm.nih.gov/pubmed/29234460
http://dx.doi.org/10.1186/s13015-017-0117-9
_version_ 1783284589969539072
author Louza, Felipe A.
Telles, Guilherme P.
Hoffmann, Steve
Ciferri, Cristina D. A.
author_facet Louza, Felipe A.
Telles, Guilherme P.
Hoffmann, Steve
Ciferri, Cristina D. A.
author_sort Louza, Felipe A.
collection PubMed
description BACKGROUND: Suffix arrays, augmented by additional data structures, allow solving efficiently many string processing problems. The external memory construction of the generalized suffix array for a string collection is a fundamental task when the size of the input collection or the data structure exceeds the available internal memory. RESULTS: In this article we present and analyze [Formula: see text] [introduced in CPM (External memory generalized suffix and [Formula: see text] arrays construction. In: Proceedings of CPM. pp 201–10, 2013)], the first external memory algorithm to construct generalized suffix arrays augmented with the longest common prefix array for a string collection. Our algorithm relies on a combination of buffers, induced sorting and a heap to avoid direct string comparisons. We performed experiments that covered different aspects of our algorithm, including running time, efficiency, external memory access, internal phases and the influence of different optimization strategies. On real datasets of size up to 24 GB and using 2 GB of internal memory, [Formula: see text] showed a competitive performance when compared to [Formula: see text] and [Formula: see text] , which are efficient algorithms for a single string according to the related literature. We also show the effect of disk caching managed by the operating system on our algorithm. CONCLUSIONS: The proposed algorithm was validated through performance tests using real datasets from different domains, in various combinations, and showed a competitive performance. Our algorithm can also construct the generalized Burrows-Wheeler transform of a string collection with no additional cost except by the output time.
format Online
Article
Text
id pubmed-5719966
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-57199662017-12-11 Generalized enhanced suffix array construction in external memory Louza, Felipe A. Telles, Guilherme P. Hoffmann, Steve Ciferri, Cristina D. A. Algorithms Mol Biol Research BACKGROUND: Suffix arrays, augmented by additional data structures, allow solving efficiently many string processing problems. The external memory construction of the generalized suffix array for a string collection is a fundamental task when the size of the input collection or the data structure exceeds the available internal memory. RESULTS: In this article we present and analyze [Formula: see text] [introduced in CPM (External memory generalized suffix and [Formula: see text] arrays construction. In: Proceedings of CPM. pp 201–10, 2013)], the first external memory algorithm to construct generalized suffix arrays augmented with the longest common prefix array for a string collection. Our algorithm relies on a combination of buffers, induced sorting and a heap to avoid direct string comparisons. We performed experiments that covered different aspects of our algorithm, including running time, efficiency, external memory access, internal phases and the influence of different optimization strategies. On real datasets of size up to 24 GB and using 2 GB of internal memory, [Formula: see text] showed a competitive performance when compared to [Formula: see text] and [Formula: see text] , which are efficient algorithms for a single string according to the related literature. We also show the effect of disk caching managed by the operating system on our algorithm. CONCLUSIONS: The proposed algorithm was validated through performance tests using real datasets from different domains, in various combinations, and showed a competitive performance. Our algorithm can also construct the generalized Burrows-Wheeler transform of a string collection with no additional cost except by the output time. BioMed Central 2017-12-07 /pmc/articles/PMC5719966/ /pubmed/29234460 http://dx.doi.org/10.1186/s13015-017-0117-9 Text en © The Author(s) 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Louza, Felipe A.
Telles, Guilherme P.
Hoffmann, Steve
Ciferri, Cristina D. A.
Generalized enhanced suffix array construction in external memory
title Generalized enhanced suffix array construction in external memory
title_full Generalized enhanced suffix array construction in external memory
title_fullStr Generalized enhanced suffix array construction in external memory
title_full_unstemmed Generalized enhanced suffix array construction in external memory
title_short Generalized enhanced suffix array construction in external memory
title_sort generalized enhanced suffix array construction in external memory
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5719966/
https://www.ncbi.nlm.nih.gov/pubmed/29234460
http://dx.doi.org/10.1186/s13015-017-0117-9
work_keys_str_mv AT louzafelipea generalizedenhancedsuffixarrayconstructioninexternalmemory
AT tellesguilhermep generalizedenhancedsuffixarrayconstructioninexternalmemory
AT hoffmannsteve generalizedenhancedsuffixarrayconstructioninexternalmemory
AT ciferricristinada generalizedenhancedsuffixarrayconstructioninexternalmemory