Cargando…

VC@Scale: Scalable and high-performance variant calling on cluster environments

BACKGROUND: Recently many new deep learning–based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a need for mo...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ahmad, Tanveer, Al Ars, Zaid, Hofstee, H Peter
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2021
Materias:	Technical Note
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8424057/ https://www.ncbi.nlm.nih.gov/pubmed/34494101 http://dx.doi.org/10.1093/gigascience/giab057

_version_	1783749592078090240
author	Ahmad, Tanveer Al Ars, Zaid Hofstee, H Peter
author_facet	Ahmad, Tanveer Al Ars, Zaid Hofstee, H Peter
author_sort	Ahmad, Tanveer
collection	PubMed
description	BACKGROUND: Recently many new deep learning–based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a need for more scalable and higher performance workflows of these deep learning methods. Almost all existing cluster-scaled variant-calling workflows that use Apache Spark/Hadoop as big data frameworks loosely integrate existing single-node pre-processing and variant-calling applications. Using Apache Spark just for distributing/scheduling data among loosely coupled applications or using I/O-based storage for storing the output of intermediate applications does not exploit the full benefit of Apache Spark in-memory processing. To achieve this, we propose a native Spark-based workflow that uses Python and Apache Arrow to enable efficient transfer of data between different workflow stages. This benefits from the ease of programmability of Python and the high efficiency of Arrow’s columnar in-memory data transformations. RESULTS: Here we present a scalable, parallel, and efficient implementation of next-generation sequencing data pre-processing and variant-calling workflows. Our design tightly integrates most pre-processing workflow stages, using Spark built-in functions to sort reads by coordinates and mark duplicates efficiently. Our approach outperforms state-of-the-art implementations by >2 times for the pre-processing stages, creating a scalable and high-performance solution for DeepVariant for both CPU-only and CPU + GPU clusters. CONCLUSIONS: We show the feasibility and easy scalability of our approach to achieve high performance and efficient resource utilization for variant-calling analysis on high-performance computing clusters using the standardized Apache Arrow data representations. All codes, scripts, and configurations used to run our implementations are publicly available and open sourced; see https://github.com/abs-tudelft/variant-calling-at-scale.
format	Online Article Text
id	pubmed-8424057
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-84240572021-09-09 VC@Scale: Scalable and high-performance variant calling on cluster environments Ahmad, Tanveer Al Ars, Zaid Hofstee, H Peter Gigascience Technical Note BACKGROUND: Recently many new deep learning–based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a need for more scalable and higher performance workflows of these deep learning methods. Almost all existing cluster-scaled variant-calling workflows that use Apache Spark/Hadoop as big data frameworks loosely integrate existing single-node pre-processing and variant-calling applications. Using Apache Spark just for distributing/scheduling data among loosely coupled applications or using I/O-based storage for storing the output of intermediate applications does not exploit the full benefit of Apache Spark in-memory processing. To achieve this, we propose a native Spark-based workflow that uses Python and Apache Arrow to enable efficient transfer of data between different workflow stages. This benefits from the ease of programmability of Python and the high efficiency of Arrow’s columnar in-memory data transformations. RESULTS: Here we present a scalable, parallel, and efficient implementation of next-generation sequencing data pre-processing and variant-calling workflows. Our design tightly integrates most pre-processing workflow stages, using Spark built-in functions to sort reads by coordinates and mark duplicates efficiently. Our approach outperforms state-of-the-art implementations by >2 times for the pre-processing stages, creating a scalable and high-performance solution for DeepVariant for both CPU-only and CPU + GPU clusters. CONCLUSIONS: We show the feasibility and easy scalability of our approach to achieve high performance and efficient resource utilization for variant-calling analysis on high-performance computing clusters using the standardized Apache Arrow data representations. All codes, scripts, and configurations used to run our implementations are publicly available and open sourced; see https://github.com/abs-tudelft/variant-calling-at-scale. Oxford University Press 2021-09-07 /pmc/articles/PMC8424057/ /pubmed/34494101 http://dx.doi.org/10.1093/gigascience/giab057 Text en © The Author(s) 2021. Published by Oxford University Press GigaScience. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Technical Note Ahmad, Tanveer Al Ars, Zaid Hofstee, H Peter VC@Scale: Scalable and high-performance variant calling on cluster environments
title	VC@Scale: Scalable and high-performance variant calling on cluster environments
title_full	VC@Scale: Scalable and high-performance variant calling on cluster environments
title_fullStr	VC@Scale: Scalable and high-performance variant calling on cluster environments
title_full_unstemmed	VC@Scale: Scalable and high-performance variant calling on cluster environments
title_short	VC@Scale: Scalable and high-performance variant calling on cluster environments
title_sort	vc@scale: scalable and high-performance variant calling on cluster environments
topic	Technical Note
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8424057/ https://www.ncbi.nlm.nih.gov/pubmed/34494101 http://dx.doi.org/10.1093/gigascience/giab057
work_keys_str_mv	AT ahmadtanveer vcscalescalableandhighperformancevariantcallingonclusterenvironments AT alarszaid vcscalescalableandhighperformancevariantcallingonclusterenvironments AT hofsteehpeter vcscalescalableandhighperformancevariantcallingonclusterenvironments

VC@Scale: Scalable and high-performance variant calling on cluster environments

Ejemplares similares