Cargando…

Prefix-free parsing for building big BWTs

High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic...

Descripción completa

Detalles Bibliográficos
Autores principales: Boucher, Christina, Gagie, Travis, Kuhnle, Alan, Langmead, Ben, Manzini, Giovanni, Mun, Taher
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6534911/
https://www.ncbi.nlm.nih.gov/pubmed/31149025
http://dx.doi.org/10.1186/s13015-019-0148-5
_version_ 1783421508276715520
author Boucher, Christina
Gagie, Travis
Kuhnle, Alan
Langmead, Ben
Manzini, Giovanni
Mun, Taher
author_facet Boucher, Christina
Gagie, Travis
Kuhnle, Alan
Langmead, Ben
Manzini, Giovanni
Mun, Taher
author_sort Boucher, Christina
collection PubMed
description High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive—a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. In particular, we show that with prefix-free parsing we can build an 131-MB run-length compressed FM-index (restricted to support only counting and not locating) for 1000 copies of human chromosome 19 in 2 h using 21  GB of memory, suggesting that we can build a 6.73 GB index for 1000 complete human-genome haplotypes in approximately 102 h using about 1 TB of memory.
format Online
Article
Text
id pubmed-6534911
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-65349112019-05-30 Prefix-free parsing for building big BWTs Boucher, Christina Gagie, Travis Kuhnle, Alan Langmead, Ben Manzini, Giovanni Mun, Taher Algorithms Mol Biol Research High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive—a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. In particular, we show that with prefix-free parsing we can build an 131-MB run-length compressed FM-index (restricted to support only counting and not locating) for 1000 copies of human chromosome 19 in 2 h using 21  GB of memory, suggesting that we can build a 6.73 GB index for 1000 complete human-genome haplotypes in approximately 102 h using about 1 TB of memory. BioMed Central 2019-05-24 /pmc/articles/PMC6534911/ /pubmed/31149025 http://dx.doi.org/10.1186/s13015-019-0148-5 Text en © The Author(s) 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Boucher, Christina
Gagie, Travis
Kuhnle, Alan
Langmead, Ben
Manzini, Giovanni
Mun, Taher
Prefix-free parsing for building big BWTs
title Prefix-free parsing for building big BWTs
title_full Prefix-free parsing for building big BWTs
title_fullStr Prefix-free parsing for building big BWTs
title_full_unstemmed Prefix-free parsing for building big BWTs
title_short Prefix-free parsing for building big BWTs
title_sort prefix-free parsing for building big bwts
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6534911/
https://www.ncbi.nlm.nih.gov/pubmed/31149025
http://dx.doi.org/10.1186/s13015-019-0148-5
work_keys_str_mv AT boucherchristina prefixfreeparsingforbuildingbigbwts
AT gagietravis prefixfreeparsingforbuildingbigbwts
AT kuhnlealan prefixfreeparsingforbuildingbigbwts
AT langmeadben prefixfreeparsingforbuildingbigbwts
AT manzinigiovanni prefixfreeparsingforbuildingbigbwts
AT muntaher prefixfreeparsingforbuildingbigbwts