Cargando…

Halvade somatic: Somatic variant calling with Apache Spark

BACKGROUND: The accurate detection of somatic variants from sequencing data is of key importance for cancer treatment and research. Somatic variant calling requires a high sequencing depth of the tumor sample, especially when the detection of low-frequency variants is also desired. In turn, this lea...

Descripción completa

Detalles Bibliográficos
Autores principales:	Decap, Dries, de Schaetzen van Brienen, Louise, Larmuseau, Maarten, Costanza, Pascal, Herzeel, Charlotte, Wuyts, Roel, Marchal, Kathleen, Fostier, Jan
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2022
Materias:	Technical Note
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8756192/ https://www.ncbi.nlm.nih.gov/pubmed/35022699 http://dx.doi.org/10.1093/gigascience/giab094

_version_	1784632515782246400
author	Decap, Dries de Schaetzen van Brienen, Louise Larmuseau, Maarten Costanza, Pascal Herzeel, Charlotte Wuyts, Roel Marchal, Kathleen Fostier, Jan
author_facet	Decap, Dries de Schaetzen van Brienen, Louise Larmuseau, Maarten Costanza, Pascal Herzeel, Charlotte Wuyts, Roel Marchal, Kathleen Fostier, Jan
author_sort	Decap, Dries
collection	PubMed
description	BACKGROUND: The accurate detection of somatic variants from sequencing data is of key importance for cancer treatment and research. Somatic variant calling requires a high sequencing depth of the tumor sample, especially when the detection of low-frequency variants is also desired. In turn, this leads to large volumes of raw sequencing data to process and hence, large computational requirements. For example, calling the somatic variants according to the GATK best practices guidelines requires days of computing time for a typical whole-genome sequencing sample. FINDINGS: We introduce Halvade Somatic, a framework for somatic variant calling from DNA sequencing data that takes advantage of multi-node and/or multi-core compute platforms to reduce runtime. It relies on Apache Spark to provide scalable I/O and to create and manage data streams that are processed on different CPU cores in parallel. Halvade Somatic contains all required steps to process the tumor and matched normal sample according to the GATK best practices recommendations: read alignment (BWA), sorting of reads, preprocessing steps such as marking duplicate reads and base quality score recalibration (GATK), and, finally, calling the somatic variants (Mutect2). Our approach reduces the runtime on a single 36-core node to 19.5 h compared to a runtime of 84.5 h for the original pipeline, a speedup of 4.3 times. Runtime can be further decreased by scaling to multiple nodes, e.g., we observe a runtime of 1.36 h using 16 nodes, an additional speedup of 14.4 times. Halvade Somatic supports variant calling from both whole-genome sequencing and whole-exome sequencing data and also supports Strelka2 as an alternative or complementary variant calling tool. We provide a Docker image to facilitate single-node deployment. Halvade Somatic can be executed on a variety of compute platforms, including Amazon EC2 and Google Cloud. CONCLUSIONS: To our knowledge, Halvade Somatic is the first somatic variant calling pipeline that leverages Big Data processing platforms and provides reliable, scalable performance. Source code is freely available.
format	Online Article Text
id	pubmed-8756192
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-87561922022-01-13 Halvade somatic: Somatic variant calling with Apache Spark Decap, Dries de Schaetzen van Brienen, Louise Larmuseau, Maarten Costanza, Pascal Herzeel, Charlotte Wuyts, Roel Marchal, Kathleen Fostier, Jan Gigascience Technical Note BACKGROUND: The accurate detection of somatic variants from sequencing data is of key importance for cancer treatment and research. Somatic variant calling requires a high sequencing depth of the tumor sample, especially when the detection of low-frequency variants is also desired. In turn, this leads to large volumes of raw sequencing data to process and hence, large computational requirements. For example, calling the somatic variants according to the GATK best practices guidelines requires days of computing time for a typical whole-genome sequencing sample. FINDINGS: We introduce Halvade Somatic, a framework for somatic variant calling from DNA sequencing data that takes advantage of multi-node and/or multi-core compute platforms to reduce runtime. It relies on Apache Spark to provide scalable I/O and to create and manage data streams that are processed on different CPU cores in parallel. Halvade Somatic contains all required steps to process the tumor and matched normal sample according to the GATK best practices recommendations: read alignment (BWA), sorting of reads, preprocessing steps such as marking duplicate reads and base quality score recalibration (GATK), and, finally, calling the somatic variants (Mutect2). Our approach reduces the runtime on a single 36-core node to 19.5 h compared to a runtime of 84.5 h for the original pipeline, a speedup of 4.3 times. Runtime can be further decreased by scaling to multiple nodes, e.g., we observe a runtime of 1.36 h using 16 nodes, an additional speedup of 14.4 times. Halvade Somatic supports variant calling from both whole-genome sequencing and whole-exome sequencing data and also supports Strelka2 as an alternative or complementary variant calling tool. We provide a Docker image to facilitate single-node deployment. Halvade Somatic can be executed on a variety of compute platforms, including Amazon EC2 and Google Cloud. CONCLUSIONS: To our knowledge, Halvade Somatic is the first somatic variant calling pipeline that leverages Big Data processing platforms and provides reliable, scalable performance. Source code is freely available. Oxford University Press 2022-01-12 /pmc/articles/PMC8756192/ /pubmed/35022699 http://dx.doi.org/10.1093/gigascience/giab094 Text en © The Author(s) 2022. Published by Oxford University Press GigaScience. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Technical Note Decap, Dries de Schaetzen van Brienen, Louise Larmuseau, Maarten Costanza, Pascal Herzeel, Charlotte Wuyts, Roel Marchal, Kathleen Fostier, Jan Halvade somatic: Somatic variant calling with Apache Spark
title	Halvade somatic: Somatic variant calling with Apache Spark
title_full	Halvade somatic: Somatic variant calling with Apache Spark
title_fullStr	Halvade somatic: Somatic variant calling with Apache Spark
title_full_unstemmed	Halvade somatic: Somatic variant calling with Apache Spark
title_short	Halvade somatic: Somatic variant calling with Apache Spark
title_sort	halvade somatic: somatic variant calling with apache spark
topic	Technical Note
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8756192/ https://www.ncbi.nlm.nih.gov/pubmed/35022699 http://dx.doi.org/10.1093/gigascience/giab094
work_keys_str_mv	AT decapdries halvadesomaticsomaticvariantcallingwithapachespark AT deschaetzenvanbrienenlouise halvadesomaticsomaticvariantcallingwithapachespark AT larmuseaumaarten halvadesomaticsomaticvariantcallingwithapachespark AT costanzapascal halvadesomaticsomaticvariantcallingwithapachespark AT herzeelcharlotte halvadesomaticsomaticvariantcallingwithapachespark AT wuytsroel halvadesomaticsomaticvariantcallingwithapachespark AT marchalkathleen halvadesomaticsomaticvariantcallingwithapachespark AT fostierjan halvadesomaticsomaticvariantcallingwithapachespark

Halvade somatic: Somatic variant calling with Apache Spark

Ejemplares similares