Cargando…

Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework

BACKGROUND: Immense improvements in sequencing technologies enable producing large amounts of high throughput and cost effective next-generation sequencing (NGS) data. This data needs to be processed efficiently for further downstream analyses. Computing systems need this large amounts of data close...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ahmad, Tanveer, Ahmed, Nauman, Al-Ars, Zaid, Hofstee, H. Peter
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2020
Materias:	Software
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7677819/ https://www.ncbi.nlm.nih.gov/pubmed/33208101 http://dx.doi.org/10.1186/s12864-020-07013-y

_version_	1783612055916380160
author	Ahmad, Tanveer Ahmed, Nauman Al-Ars, Zaid Hofstee, H. Peter
author_facet	Ahmad, Tanveer Ahmed, Nauman Al-Ars, Zaid Hofstee, H. Peter
author_sort	Ahmad, Tanveer
collection	PubMed
description	BACKGROUND: Immense improvements in sequencing technologies enable producing large amounts of high throughput and cost effective next-generation sequencing (NGS) data. This data needs to be processed efficiently for further downstream analyses. Computing systems need this large amounts of data closer to the processor (with low latency) for fast and efficient processing. However, existing workflows depend heavily on disk storage and access, to process this data incurs huge disk I/O overheads. Previously, due to the cost, volatility and other physical constraints of DRAM memory, it was not feasible to place large amounts of working data sets in memory. However, recent developments in storage-class memory and non-volatile memory technologies have enabled computing systems to place huge data in memory to process it directly from memory to avoid disk I/O bottlenecks. To exploit the benefits of such memory systems efficiently, proper formatted data placement in memory and its high throughput access is necessary by avoiding (de)-serialization and copy overheads in between processes. For this purpose, we use the newly developed Apache Arrow, a cross-language development framework that provides language-independent columnar in-memory data format for efficient in-memory big data analytics. This allows genomics applications developed in different programming languages to communicate in-memory without having to access disk storage and avoiding (de)-serialization and copy overheads. IMPLEMENTATION: We integrate Apache Arrow in-memory based Sequence Alignment/Map (SAM) format and its shared memory objects store library in widely used genomics high throughput data processing applications like BWA-MEM, Picard and GATK to allow in-memory communication between these applications. In addition, this also allows us to exploit the cache locality of tabular data and parallel processing capabilities through shared memory objects. RESULTS: Our implementation shows that adopting in-memory SAM representation in genomics high throughput data processing applications results in better system resource utilization, low number of memory accesses due to high cache locality exploitation and parallel scalability due to shared memory objects. Our implementation focuses on the GATK best practices recommended workflows for germline analysis on whole genome sequencing (WGS) and whole exome sequencing (WES) data sets. We compare a number of existing in-memory data placing and sharing techniques like ramDisk and Unix pipes to show how columnar in-memory data representation outperforms both. We achieve a speedup of 4.85x and 4.76x for WGS and WES data, respectively, in overall execution time of variant calling workflows. Similarly, a speedup of 1.45x and 1.27x for these data sets, respectively, is achieved, as compared to the second fastest workflow. In some individual tools, particularly in sorting, duplicates removal and base quality score recalibration the speedup is even more promising. AVAILABILITY: The code and scripts used in our experiments are available in both container and repository form at: https://github.com/abs-tudelft/ArrowSAM.
format	Online Article Text
id	pubmed-7677819
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-76778192020-11-20 Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework Ahmad, Tanveer Ahmed, Nauman Al-Ars, Zaid Hofstee, H. Peter BMC Genomics Software BACKGROUND: Immense improvements in sequencing technologies enable producing large amounts of high throughput and cost effective next-generation sequencing (NGS) data. This data needs to be processed efficiently for further downstream analyses. Computing systems need this large amounts of data closer to the processor (with low latency) for fast and efficient processing. However, existing workflows depend heavily on disk storage and access, to process this data incurs huge disk I/O overheads. Previously, due to the cost, volatility and other physical constraints of DRAM memory, it was not feasible to place large amounts of working data sets in memory. However, recent developments in storage-class memory and non-volatile memory technologies have enabled computing systems to place huge data in memory to process it directly from memory to avoid disk I/O bottlenecks. To exploit the benefits of such memory systems efficiently, proper formatted data placement in memory and its high throughput access is necessary by avoiding (de)-serialization and copy overheads in between processes. For this purpose, we use the newly developed Apache Arrow, a cross-language development framework that provides language-independent columnar in-memory data format for efficient in-memory big data analytics. This allows genomics applications developed in different programming languages to communicate in-memory without having to access disk storage and avoiding (de)-serialization and copy overheads. IMPLEMENTATION: We integrate Apache Arrow in-memory based Sequence Alignment/Map (SAM) format and its shared memory objects store library in widely used genomics high throughput data processing applications like BWA-MEM, Picard and GATK to allow in-memory communication between these applications. In addition, this also allows us to exploit the cache locality of tabular data and parallel processing capabilities through shared memory objects. RESULTS: Our implementation shows that adopting in-memory SAM representation in genomics high throughput data processing applications results in better system resource utilization, low number of memory accesses due to high cache locality exploitation and parallel scalability due to shared memory objects. Our implementation focuses on the GATK best practices recommended workflows for germline analysis on whole genome sequencing (WGS) and whole exome sequencing (WES) data sets. We compare a number of existing in-memory data placing and sharing techniques like ramDisk and Unix pipes to show how columnar in-memory data representation outperforms both. We achieve a speedup of 4.85x and 4.76x for WGS and WES data, respectively, in overall execution time of variant calling workflows. Similarly, a speedup of 1.45x and 1.27x for these data sets, respectively, is achieved, as compared to the second fastest workflow. In some individual tools, particularly in sorting, duplicates removal and base quality score recalibration the speedup is even more promising. AVAILABILITY: The code and scripts used in our experiments are available in both container and repository form at: https://github.com/abs-tudelft/ArrowSAM. BioMed Central 2020-11-18 /pmc/articles/PMC7677819/ /pubmed/33208101 http://dx.doi.org/10.1186/s12864-020-07013-y Text en © The Author(s) 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Software Ahmad, Tanveer Ahmed, Nauman Al-Ars, Zaid Hofstee, H. Peter Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework
title	Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework
title_full	Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework
title_fullStr	Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework
title_full_unstemmed	Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework
title_short	Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework
title_sort	optimizing performance of gatk workflows using apache arrow in-memory data framework
topic	Software
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7677819/ https://www.ncbi.nlm.nih.gov/pubmed/33208101 http://dx.doi.org/10.1186/s12864-020-07013-y
work_keys_str_mv	AT ahmadtanveer optimizingperformanceofgatkworkflowsusingapachearrowinmemorydataframework AT ahmednauman optimizingperformanceofgatkworkflowsusingapachearrowinmemorydataframework AT alarszaid optimizingperformanceofgatkworkflowsusingapachearrowinmemorydataframework AT hofsteehpeter optimizingperformanceofgatkworkflowsusingapachearrowinmemorydataframework

Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework

Ejemplares similares