Cargando…

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

Short-read aligners predominantly use the FM-index, which is easily able to index one or a few human genomes. However, it does not scale well to indexing collections of thousands of genomes. Driving this issue are the two chief components of the index: (1) a rank data structure over the Burrows–Whee...

Descripción completa

Detalles Bibliográficos
Autores principales: Kuhnle, Alan, Mun, Taher, Boucher, Christina, Gagie, Travis, Langmead, Ben, Manzini, Giovanni
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Mary Ann Liebert, Inc., publishers 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7185338/
https://www.ncbi.nlm.nih.gov/pubmed/32181684
http://dx.doi.org/10.1089/cmb.2019.0309
_version_ 1783526741365489664
author Kuhnle, Alan
Mun, Taher
Boucher, Christina
Gagie, Travis
Langmead, Ben
Manzini, Giovanni
author_facet Kuhnle, Alan
Mun, Taher
Boucher, Christina
Gagie, Travis
Langmead, Ben
Manzini, Giovanni
author_sort Kuhnle, Alan
collection PubMed
description Short-read aligners predominantly use the FM-index, which is easily able to index one or a few human genomes. However, it does not scale well to indexing collections of thousands of genomes. Driving this issue are the two chief components of the index: (1) a rank data structure over the Burrows–Wheeler Transform (BWT) of the string that will allow us to find the interval in the string's suffix array (SA), and (2) a sample of the SA that—when used with the rank data structure—allows us to access the SA. The rank data structure can be kept small even for large genomic databases, by run-length compressing the BWT, but until recently there was no means known to keep the SA sample small without greatly slowing down access to the SA. Now that (SODA 2018) has defined an SA sample that takes about the same space as the run-length compressed BWT, we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018, we showed how to build the BWT of large genomic databases efficiently (WABI 2018), but the problem of building the sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the SA sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes and show that it improves over the FM-index-based Bowtie method with respect to both memory and time and over the hybrid index-based CHIC method with respect to query time and memory required for indexing.
format Online
Article
Text
id pubmed-7185338
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Mary Ann Liebert, Inc., publishers
record_format MEDLINE/PubMed
spelling pubmed-71853382020-05-06 Efficient Construction of a Complete Index for Pan-Genomics Read Alignment Kuhnle, Alan Mun, Taher Boucher, Christina Gagie, Travis Langmead, Ben Manzini, Giovanni J Comput Biol Research Articles Short-read aligners predominantly use the FM-index, which is easily able to index one or a few human genomes. However, it does not scale well to indexing collections of thousands of genomes. Driving this issue are the two chief components of the index: (1) a rank data structure over the Burrows–Wheeler Transform (BWT) of the string that will allow us to find the interval in the string's suffix array (SA), and (2) a sample of the SA that—when used with the rank data structure—allows us to access the SA. The rank data structure can be kept small even for large genomic databases, by run-length compressing the BWT, but until recently there was no means known to keep the SA sample small without greatly slowing down access to the SA. Now that (SODA 2018) has defined an SA sample that takes about the same space as the run-length compressed BWT, we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018, we showed how to build the BWT of large genomic databases efficiently (WABI 2018), but the problem of building the sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the SA sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes and show that it improves over the FM-index-based Bowtie method with respect to both memory and time and over the hybrid index-based CHIC method with respect to query time and memory required for indexing. Mary Ann Liebert, Inc., publishers 2020-04-01 2020-04-08 /pmc/articles/PMC7185338/ /pubmed/32181684 http://dx.doi.org/10.1089/cmb.2019.0309 Text en © Alan Kuhnle, et al., 2020. Published by Mary Ann Liebert, Inc. This Open Access article is distributed under the terms of the Creative Commons License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.
spellingShingle Research Articles
Kuhnle, Alan
Mun, Taher
Boucher, Christina
Gagie, Travis
Langmead, Ben
Manzini, Giovanni
Efficient Construction of a Complete Index for Pan-Genomics Read Alignment
title Efficient Construction of a Complete Index for Pan-Genomics Read Alignment
title_full Efficient Construction of a Complete Index for Pan-Genomics Read Alignment
title_fullStr Efficient Construction of a Complete Index for Pan-Genomics Read Alignment
title_full_unstemmed Efficient Construction of a Complete Index for Pan-Genomics Read Alignment
title_short Efficient Construction of a Complete Index for Pan-Genomics Read Alignment
title_sort efficient construction of a complete index for pan-genomics read alignment
topic Research Articles
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7185338/
https://www.ncbi.nlm.nih.gov/pubmed/32181684
http://dx.doi.org/10.1089/cmb.2019.0309
work_keys_str_mv AT kuhnlealan efficientconstructionofacompleteindexforpangenomicsreadalignment
AT muntaher efficientconstructionofacompleteindexforpangenomicsreadalignment
AT boucherchristina efficientconstructionofacompleteindexforpangenomicsreadalignment
AT gagietravis efficientconstructionofacompleteindexforpangenomicsreadalignment
AT langmeadben efficientconstructionofacompleteindexforpangenomicsreadalignment
AT manzinigiovanni efficientconstructionofacompleteindexforpangenomicsreadalignment