Cargando…

Cloud-native distributed genomic pileup operations

MOTIVATION: Pileup analysis is a building block of many bioinformatics pipelines, including variant calling and genotyping. This step tends to become a bottleneck of the entire assay since the straightforward pileup implementations involve processing of all base calls from all alignments sequentiall...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wiewiórka, Marek, Szmurło, Agnieszka, Stankiewicz, Paweł, Gambin, Tomasz
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2022
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9848050/ https://www.ncbi.nlm.nih.gov/pubmed/36515465 http://dx.doi.org/10.1093/bioinformatics/btac804

_version_	1784871617154777088
author	Wiewiórka, Marek Szmurło, Agnieszka Stankiewicz, Paweł Gambin, Tomasz
author_facet	Wiewiórka, Marek Szmurło, Agnieszka Stankiewicz, Paweł Gambin, Tomasz
author_sort	Wiewiórka, Marek
collection	PubMed
description	MOTIVATION: Pileup analysis is a building block of many bioinformatics pipelines, including variant calling and genotyping. This step tends to become a bottleneck of the entire assay since the straightforward pileup implementations involve processing of all base calls from all alignments sequentially. On the other hand, a distributed version of the algorithm faces the intrinsic challenge of splitting reads-oriented file formats into self-contained partitions to avoid costly data exchange between computational nodes. RESULTS: Here, we present a scalable, distributed and efficient implementation of a pileup algorithm that is suitable for deploying in cloud computing environments. In particular, we implemented: (i) our custom data-partitioning algorithm optimized to work with the alignment reads, (ii) a novel and unique approach to process alignment events from sequencing reads using the MD tags, (iii) the source code micro-optimizations for recurrent operations, and (iv) a modular structure of the algorithm. We have proven that our novel approach consistently and significantly outperforms other state-of-the-art distributed tools in terms of execution time (up to 6.5× faster) and memory usage (up to 2× less), resulting in a substantial cloud cost reduction. SeQuiLa is a cloud-native solution that can be easily deployed using any managed Kubernetes and Hadoop services available in public clouds, like Microsoft Azure Cloud, Google Cloud Platform, or Amazon Web Services. Together with the already implemented distributed range join and coverage calculations, our package provides end-users with a unified SQL interface for convenient analyses of population-scale genomic data in an interactive way. AVAILABILITY AND IMPLEMENTATION: https://biodatageeks.github.io/sequila/
format	Online Article Text
id	pubmed-9848050
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-98480502023-01-20 Cloud-native distributed genomic pileup operations Wiewiórka, Marek Szmurło, Agnieszka Stankiewicz, Paweł Gambin, Tomasz Bioinformatics Original Paper MOTIVATION: Pileup analysis is a building block of many bioinformatics pipelines, including variant calling and genotyping. This step tends to become a bottleneck of the entire assay since the straightforward pileup implementations involve processing of all base calls from all alignments sequentially. On the other hand, a distributed version of the algorithm faces the intrinsic challenge of splitting reads-oriented file formats into self-contained partitions to avoid costly data exchange between computational nodes. RESULTS: Here, we present a scalable, distributed and efficient implementation of a pileup algorithm that is suitable for deploying in cloud computing environments. In particular, we implemented: (i) our custom data-partitioning algorithm optimized to work with the alignment reads, (ii) a novel and unique approach to process alignment events from sequencing reads using the MD tags, (iii) the source code micro-optimizations for recurrent operations, and (iv) a modular structure of the algorithm. We have proven that our novel approach consistently and significantly outperforms other state-of-the-art distributed tools in terms of execution time (up to 6.5× faster) and memory usage (up to 2× less), resulting in a substantial cloud cost reduction. SeQuiLa is a cloud-native solution that can be easily deployed using any managed Kubernetes and Hadoop services available in public clouds, like Microsoft Azure Cloud, Google Cloud Platform, or Amazon Web Services. Together with the already implemented distributed range join and coverage calculations, our package provides end-users with a unified SQL interface for convenient analyses of population-scale genomic data in an interactive way. AVAILABILITY AND IMPLEMENTATION: https://biodatageeks.github.io/sequila/ Oxford University Press 2022-12-14 /pmc/articles/PMC9848050/ /pubmed/36515465 http://dx.doi.org/10.1093/bioinformatics/btac804 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Paper Wiewiórka, Marek Szmurło, Agnieszka Stankiewicz, Paweł Gambin, Tomasz Cloud-native distributed genomic pileup operations
title	Cloud-native distributed genomic pileup operations
title_full	Cloud-native distributed genomic pileup operations
title_fullStr	Cloud-native distributed genomic pileup operations
title_full_unstemmed	Cloud-native distributed genomic pileup operations
title_short	Cloud-native distributed genomic pileup operations
title_sort	cloud-native distributed genomic pileup operations
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9848050/ https://www.ncbi.nlm.nih.gov/pubmed/36515465 http://dx.doi.org/10.1093/bioinformatics/btac804
work_keys_str_mv	AT wiewiorkamarek cloudnativedistributedgenomicpileupoperations AT szmurłoagnieszka cloudnativedistributedgenomicpileupoperations AT stankiewiczpaweł cloudnativedistributedgenomicpileupoperations AT gambintomasz cloudnativedistributedgenomicpileupoperations

Cloud-native distributed genomic pileup operations

Ejemplares similares