Cargando…

External memory BWT and LCP computation for sequence collections with applications

BACKGROUND: Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows–Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the...

Descripción completa

Detalles Bibliográficos
Autores principales: Egidi, Lavinia, Louza, Felipe A., Manzini, Giovanni, Telles, Guilherme P.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6408864/
https://www.ncbi.nlm.nih.gov/pubmed/30899322
http://dx.doi.org/10.1186/s13015-019-0140-0
_version_ 1783401871733424128
author Egidi, Lavinia
Louza, Felipe A.
Manzini, Giovanni
Telles, Guilherme P.
author_facet Egidi, Lavinia
Louza, Felipe A.
Manzini, Giovanni
Telles, Guilherme P.
author_sort Egidi, Lavinia
collection PubMed
description BACKGROUND: Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows–Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM. RESULTS: We propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix–prefix overlaps, and the construction of succinct de Bruijn graphs. CONCLUSIONS: We prove that our algorithm performs [Formula: see text] sequential I/Os, where n is the total length of the collection and [Formula: see text] is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input.
format Online
Article
Text
id pubmed-6408864
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-64088642019-03-21 External memory BWT and LCP computation for sequence collections with applications Egidi, Lavinia Louza, Felipe A. Manzini, Giovanni Telles, Guilherme P. Algorithms Mol Biol Research BACKGROUND: Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows–Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM. RESULTS: We propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix–prefix overlaps, and the construction of succinct de Bruijn graphs. CONCLUSIONS: We prove that our algorithm performs [Formula: see text] sequential I/Os, where n is the total length of the collection and [Formula: see text] is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input. BioMed Central 2019-03-08 /pmc/articles/PMC6408864/ /pubmed/30899322 http://dx.doi.org/10.1186/s13015-019-0140-0 Text en © The Author(s) 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Egidi, Lavinia
Louza, Felipe A.
Manzini, Giovanni
Telles, Guilherme P.
External memory BWT and LCP computation for sequence collections with applications
title External memory BWT and LCP computation for sequence collections with applications
title_full External memory BWT and LCP computation for sequence collections with applications
title_fullStr External memory BWT and LCP computation for sequence collections with applications
title_full_unstemmed External memory BWT and LCP computation for sequence collections with applications
title_short External memory BWT and LCP computation for sequence collections with applications
title_sort external memory bwt and lcp computation for sequence collections with applications
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6408864/
https://www.ncbi.nlm.nih.gov/pubmed/30899322
http://dx.doi.org/10.1186/s13015-019-0140-0
work_keys_str_mv AT egidilavinia externalmemorybwtandlcpcomputationforsequencecollectionswithapplications
AT louzafelipea externalmemorybwtandlcpcomputationforsequencecollectionswithapplications
AT manzinigiovanni externalmemorybwtandlcpcomputationforsequencecollectionswithapplications
AT tellesguilhermep externalmemorybwtandlcpcomputationforsequencecollectionswithapplications