Cargando…

DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark

BACKGROUND: XHMM is a widely used tool for copy-number variant (CNV) discovery from whole exome sequencing data but can require hours to days to run for large cohorts. A more scalable implementation would reduce the need for specialized computational resources and enable increased exploration of the...

Descripción completa

Detalles Bibliográficos
Autores principales:	Linderman, Michael D., Chia, Davin, Wallace, Forrest, Nothaft, Frank A.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Software
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6787990/ https://www.ncbi.nlm.nih.gov/pubmed/31604420 http://dx.doi.org/10.1186/s12859-019-3108-7

_version_	1783458399124455424
author	Linderman, Michael D. Chia, Davin Wallace, Forrest Nothaft, Frank A.
author_facet	Linderman, Michael D. Chia, Davin Wallace, Forrest Nothaft, Frank A.
author_sort	Linderman, Michael D.
collection	PubMed
description	BACKGROUND: XHMM is a widely used tool for copy-number variant (CNV) discovery from whole exome sequencing data but can require hours to days to run for large cohorts. A more scalable implementation would reduce the need for specialized computational resources and enable increased exploration of the configuration parameter space to obtain the best possible results. RESULTS: DECA is a horizontally scalable implementation of the XHMM algorithm using the ADAM framework and Apache Spark that incorporates novel algorithmic optimizations to eliminate unneeded computation. DECA parallelizes XHMM on both multi-core shared memory computers and large shared-nothing Spark clusters. We performed CNV discovery from the read-depth matrix in 2535 exomes in 9.3 min on a 16-core workstation (35.3× speedup vs. XHMM), 12.7 min using 10 executor cores on a Spark cluster (18.8× speedup vs. XHMM), and 9.8 min using 32 executor cores on Amazon AWS’ Elastic MapReduce. We performed CNV discovery from the original BAM files in 292 min using 640 executor cores on a Spark cluster. CONCLUSIONS: We describe DECA’s performance, our algorithmic and implementation enhancements to XHMM to obtain that performance, and our lessons learned porting a complex genome analysis application to ADAM and Spark. ADAM and Apache Spark are a performant and productive platform for implementing large-scale genome analyses, but efficiently utilizing large clusters can require algorithmic optimizations and careful attention to Spark’s configuration parameters.
format	Online Article Text
id	pubmed-6787990
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-67879902019-10-18 DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark Linderman, Michael D. Chia, Davin Wallace, Forrest Nothaft, Frank A. BMC Bioinformatics Software BACKGROUND: XHMM is a widely used tool for copy-number variant (CNV) discovery from whole exome sequencing data but can require hours to days to run for large cohorts. A more scalable implementation would reduce the need for specialized computational resources and enable increased exploration of the configuration parameter space to obtain the best possible results. RESULTS: DECA is a horizontally scalable implementation of the XHMM algorithm using the ADAM framework and Apache Spark that incorporates novel algorithmic optimizations to eliminate unneeded computation. DECA parallelizes XHMM on both multi-core shared memory computers and large shared-nothing Spark clusters. We performed CNV discovery from the read-depth matrix in 2535 exomes in 9.3 min on a 16-core workstation (35.3× speedup vs. XHMM), 12.7 min using 10 executor cores on a Spark cluster (18.8× speedup vs. XHMM), and 9.8 min using 32 executor cores on Amazon AWS’ Elastic MapReduce. We performed CNV discovery from the original BAM files in 292 min using 640 executor cores on a Spark cluster. CONCLUSIONS: We describe DECA’s performance, our algorithmic and implementation enhancements to XHMM to obtain that performance, and our lessons learned porting a complex genome analysis application to ADAM and Spark. ADAM and Apache Spark are a performant and productive platform for implementing large-scale genome analyses, but efficiently utilizing large clusters can require algorithmic optimizations and careful attention to Spark’s configuration parameters. BioMed Central 2019-10-11 /pmc/articles/PMC6787990/ /pubmed/31604420 http://dx.doi.org/10.1186/s12859-019-3108-7 Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Software Linderman, Michael D. Chia, Davin Wallace, Forrest Nothaft, Frank A. DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark
title	DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark
title_full	DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark
title_fullStr	DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark
title_full_unstemmed	DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark
title_short	DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark
title_sort	deca: scalable xhmm exome copy-number variant calling with adam and apache spark
topic	Software
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6787990/ https://www.ncbi.nlm.nih.gov/pubmed/31604420 http://dx.doi.org/10.1186/s12859-019-3108-7
work_keys_str_mv	AT lindermanmichaeld decascalablexhmmexomecopynumbervariantcallingwithadamandapachespark AT chiadavin decascalablexhmmexomecopynumbervariantcallingwithadamandapachespark AT wallaceforrest decascalablexhmmexomecopynumbervariantcallingwithadamandapachespark AT nothaftfranka decascalablexhmmexomecopynumbervariantcallingwithadamandapachespark

DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark

Ejemplares similares