Cargando…

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

Short-read aligners predominantly use the FM-index, which is easily able to index one or a few human genomes. However, it does not scale well to indexing collections of thousands of genomes. Driving this issue are the two chief components of the index: (1) a rank data structure over the Burrows–Whee...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kuhnle, Alan, Mun, Taher, Boucher, Christina, Gagie, Travis, Langmead, Ben, Manzini, Giovanni
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Mary Ann Liebert, Inc., publishers 2020
Materias:	Research Articles
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7185338/ https://www.ncbi.nlm.nih.gov/pubmed/32181684 http://dx.doi.org/10.1089/cmb.2019.0309

_version_	1783526741365489664
author	Kuhnle, Alan Mun, Taher Boucher, Christina Gagie, Travis Langmead, Ben Manzini, Giovanni
author_facet	Kuhnle, Alan Mun, Taher Boucher, Christina Gagie, Travis Langmead, Ben Manzini, Giovanni
author_sort	Kuhnle, Alan
collection	PubMed
description	Short-read aligners predominantly use the FM-index, which is easily able to index one or a few human genomes. However, it does not scale well to indexing collections of thousands of genomes. Driving this issue are the two chief components of the index: (1) a rank data structure over the Burrows–Wheeler Transform (BWT) of the string that will allow us to find the interval in the string's suffix array (SA), and (2) a sample of the SA that—when used with the rank data structure—allows us to access the SA. The rank data structure can be kept small even for large genomic databases, by run-length compressing the BWT, but until recently there was no means known to keep the SA sample small without greatly slowing down access to the SA. Now that (SODA 2018) has defined an SA sample that takes about the same space as the run-length compressed BWT, we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018, we showed how to build the BWT of large genomic databases efficiently (WABI 2018), but the problem of building the sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the SA sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes and show that it improves over the FM-index-based Bowtie method with respect to both memory and time and over the hybrid index-based CHIC method with respect to query time and memory required for indexing.
format	Online Article Text
id	pubmed-7185338
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Mary Ann Liebert, Inc., publishers
record_format	MEDLINE/PubMed
spelling	pubmed-71853382020-05-06 Efficient Construction of a Complete Index for Pan-Genomics Read Alignment Kuhnle, Alan Mun, Taher Boucher, Christina Gagie, Travis Langmead, Ben Manzini, Giovanni J Comput Biol Research Articles Short-read aligners predominantly use the FM-index, which is easily able to index one or a few human genomes. However, it does not scale well to indexing collections of thousands of genomes. Driving this issue are the two chief components of the index: (1) a rank data structure over the Burrows–Wheeler Transform (BWT) of the string that will allow us to find the interval in the string's suffix array (SA), and (2) a sample of the SA that—when used with the rank data structure—allows us to access the SA. The rank data structure can be kept small even for large genomic databases, by run-length compressing the BWT, but until recently there was no means known to keep the SA sample small without greatly slowing down access to the SA. Now that (SODA 2018) has defined an SA sample that takes about the same space as the run-length compressed BWT, we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018, we showed how to build the BWT of large genomic databases efficiently (WABI 2018), but the problem of building the sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the SA sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes and show that it improves over the FM-index-based Bowtie method with respect to both memory and time and over the hybrid index-based CHIC method with respect to query time and memory required for indexing. Mary Ann Liebert, Inc., publishers 2020-04-01 2020-04-08 /pmc/articles/PMC7185338/ /pubmed/32181684 http://dx.doi.org/10.1089/cmb.2019.0309 Text en © Alan Kuhnle, et al., 2020. Published by Mary Ann Liebert, Inc. This Open Access article is distributed under the terms of the Creative Commons License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.
spellingShingle	Research Articles Kuhnle, Alan Mun, Taher Boucher, Christina Gagie, Travis Langmead, Ben Manzini, Giovanni Efficient Construction of a Complete Index for Pan-Genomics Read Alignment
title	Efficient Construction of a Complete Index for Pan-Genomics Read Alignment
title_full	Efficient Construction of a Complete Index for Pan-Genomics Read Alignment
title_fullStr	Efficient Construction of a Complete Index for Pan-Genomics Read Alignment
title_full_unstemmed	Efficient Construction of a Complete Index for Pan-Genomics Read Alignment
title_short	Efficient Construction of a Complete Index for Pan-Genomics Read Alignment
title_sort	efficient construction of a complete index for pan-genomics read alignment
topic	Research Articles
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7185338/ https://www.ncbi.nlm.nih.gov/pubmed/32181684 http://dx.doi.org/10.1089/cmb.2019.0309
work_keys_str_mv	AT kuhnlealan efficientconstructionofacompleteindexforpangenomicsreadalignment AT muntaher efficientconstructionofacompleteindexforpangenomicsreadalignment AT boucherchristina efficientconstructionofacompleteindexforpangenomicsreadalignment AT gagietravis efficientconstructionofacompleteindexforpangenomicsreadalignment AT langmeadben efficientconstructionofacompleteindexforpangenomicsreadalignment AT manzinigiovanni efficientconstructionofacompleteindexforpangenomicsreadalignment

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

Ejemplares similares