Cargando…

Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment

Computational pan-genomics utilizes information from multiple individual genomes in large-scale comparative analysis. Genetic variation between case-controls, ethnic groups, or species can be discovered thoroughly using pan-genomes of such subpopulations. Whole-genome sequencing (WGS) data volumes a...

Descripción completa

Detalles Bibliográficos
Autores principales:	Maarala, Altti Ilari, Arasalo, Ossi, Valenzuela, Daniel, Mäkinen, Veli, Heljanko, Keijo
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2021
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8330939/ https://www.ncbi.nlm.nih.gov/pubmed/34343181 http://dx.doi.org/10.1371/journal.pone.0255260

_version_	1783732828866871296
author	Maarala, Altti Ilari Arasalo, Ossi Valenzuela, Daniel Mäkinen, Veli Heljanko, Keijo
author_facet	Maarala, Altti Ilari Arasalo, Ossi Valenzuela, Daniel Mäkinen, Veli Heljanko, Keijo
author_sort	Maarala, Altti Ilari
collection	PubMed
description	Computational pan-genomics utilizes information from multiple individual genomes in large-scale comparative analysis. Genetic variation between case-controls, ethnic groups, or species can be discovered thoroughly using pan-genomes of such subpopulations. Whole-genome sequencing (WGS) data volumes are growing rapidly, making genomic data compression and indexing methods very important. Despite current space-efficient repetitive sequence compression and indexing methods, the deployed compression methods are often sequential, computationally time-consuming, and do not provide efficient sequence alignment performance on vast collections of genomes such as pan-genomes. For performing rapid analytics with the ever-growing genomics data, data compression and indexing methods have to exploit distributed and parallel computing more efficiently. Instead of strict genome data compression methods, we will focus on the efficient construction of a compressed index for pan-genomes. Compressed hybrid-index enables fast sequence alignments to several genomes at once while shrinking the index size significantly compared to traditional indexes. We propose a scalable distributed compressed hybrid-indexing method for large genomic data sets enabling pan-genome-based sequence search and read alignment capabilities. We show the scalability of our tool, DHPGIndex, by executing experiments in a distributed Apache Spark-based computing cluster comprising 448 cores distributed over 26 nodes. The experiments have been performed both with human and bacterial genomes. DHPGIndex built a BLAST index for n = 250 human pan-genome with an 870:1 compression ratio (CR) in 342 minutes and a Bowtie2 index with 157:1 CR in 397 minutes. For n = 1,000 human pan-genome, the BLAST index was built in 1520 minutes with 532:1 CR and the Bowtie2 index in 1938 minutes with 76:1 CR. Bowtie2 aligned 14.6 GB of paired-end reads to the compressed (n = 1,000) index in 31.7 minutes on a single node. Compressing n = 13,375,031 (488 GB) GenBank database to BLAST index resulted in CR of 62:1 in 575 minutes. BLASTing 189,864 Crispr-Cas9 gRNA target sequences (23 MB in total) to the compressed index of human pan-genome (n = 1,000) finished in 45 minutes on a single node. 30 MB mixed bacterial sequences were (n = 599) were blasted to the compressed index of 488 GB GenBank database (n = 13,375,031) in 26 minutes on 25 nodes. 78 MB mixed sequences (n = 4,167) were blasted to the compressed index of 18 GB E. coli sequence database (n = 745,409) in 5.4 minutes on a single node.
format	Online Article Text
id	pubmed-8330939
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-83309392021-08-04 Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment Maarala, Altti Ilari Arasalo, Ossi Valenzuela, Daniel Mäkinen, Veli Heljanko, Keijo PLoS One Research Article Computational pan-genomics utilizes information from multiple individual genomes in large-scale comparative analysis. Genetic variation between case-controls, ethnic groups, or species can be discovered thoroughly using pan-genomes of such subpopulations. Whole-genome sequencing (WGS) data volumes are growing rapidly, making genomic data compression and indexing methods very important. Despite current space-efficient repetitive sequence compression and indexing methods, the deployed compression methods are often sequential, computationally time-consuming, and do not provide efficient sequence alignment performance on vast collections of genomes such as pan-genomes. For performing rapid analytics with the ever-growing genomics data, data compression and indexing methods have to exploit distributed and parallel computing more efficiently. Instead of strict genome data compression methods, we will focus on the efficient construction of a compressed index for pan-genomes. Compressed hybrid-index enables fast sequence alignments to several genomes at once while shrinking the index size significantly compared to traditional indexes. We propose a scalable distributed compressed hybrid-indexing method for large genomic data sets enabling pan-genome-based sequence search and read alignment capabilities. We show the scalability of our tool, DHPGIndex, by executing experiments in a distributed Apache Spark-based computing cluster comprising 448 cores distributed over 26 nodes. The experiments have been performed both with human and bacterial genomes. DHPGIndex built a BLAST index for n = 250 human pan-genome with an 870:1 compression ratio (CR) in 342 minutes and a Bowtie2 index with 157:1 CR in 397 minutes. For n = 1,000 human pan-genome, the BLAST index was built in 1520 minutes with 532:1 CR and the Bowtie2 index in 1938 minutes with 76:1 CR. Bowtie2 aligned 14.6 GB of paired-end reads to the compressed (n = 1,000) index in 31.7 minutes on a single node. Compressing n = 13,375,031 (488 GB) GenBank database to BLAST index resulted in CR of 62:1 in 575 minutes. BLASTing 189,864 Crispr-Cas9 gRNA target sequences (23 MB in total) to the compressed index of human pan-genome (n = 1,000) finished in 45 minutes on a single node. 30 MB mixed bacterial sequences were (n = 599) were blasted to the compressed index of 488 GB GenBank database (n = 13,375,031) in 26 minutes on 25 nodes. 78 MB mixed sequences (n = 4,167) were blasted to the compressed index of 18 GB E. coli sequence database (n = 745,409) in 5.4 minutes on a single node. Public Library of Science 2021-08-03 /pmc/articles/PMC8330939/ /pubmed/34343181 http://dx.doi.org/10.1371/journal.pone.0255260 Text en © 2021 Maarala et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Maarala, Altti Ilari Arasalo, Ossi Valenzuela, Daniel Mäkinen, Veli Heljanko, Keijo Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment
title	Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment
title_full	Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment
title_fullStr	Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment
title_full_unstemmed	Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment
title_short	Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment
title_sort	distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8330939/ https://www.ncbi.nlm.nih.gov/pubmed/34343181 http://dx.doi.org/10.1371/journal.pone.0255260
work_keys_str_mv	AT maaralaalttiilari distributedhybridindexingofcompressedpangenomesforscalableandfastsequencealignment AT arasaloossi distributedhybridindexingofcompressedpangenomesforscalableandfastsequencealignment AT valenzueladaniel distributedhybridindexingofcompressedpangenomesforscalableandfastsequencealignment AT makinenveli distributedhybridindexingofcompressedpangenomesforscalableandfastsequencealignment AT heljankokeijo distributedhybridindexingofcompressedpangenomesforscalableandfastsequencealignment

Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment

Ejemplares similares