Cargando…

Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files

BACKGROUND: Sorted merging of genomic data is a common data operation necessary in many sequencing-based studies. It involves sorting and merging genomic data from different subjects by their genomic locations. In particular, merging a large number of variant call format (VCF) files is frequently re...

Descripción completa

Detalles Bibliográficos
Autores principales: Sun, Xiaobo, Gao, Jingjing, Jin, Peng, Eng, Celeste, Burchard, Esteban G, Beaty, Terri H, Ruczinski, Ingo, Mathias, Rasika A, Barnes, Kathleen, Wang, Fusheng, Qin, Zhaohui S
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6007233/
https://www.ncbi.nlm.nih.gov/pubmed/29762754
http://dx.doi.org/10.1093/gigascience/giy052
_version_ 1783332997701828608
author Sun, Xiaobo
Gao, Jingjing
Jin, Peng
Eng, Celeste
Burchard, Esteban G
Beaty, Terri H
Ruczinski, Ingo
Mathias, Rasika A
Barnes, Kathleen
Wang, Fusheng
Qin, Zhaohui S
author_facet Sun, Xiaobo
Gao, Jingjing
Jin, Peng
Eng, Celeste
Burchard, Esteban G
Beaty, Terri H
Ruczinski, Ingo
Mathias, Rasika A
Barnes, Kathleen
Wang, Fusheng
Qin, Zhaohui S
author_sort Sun, Xiaobo
collection PubMed
description BACKGROUND: Sorted merging of genomic data is a common data operation necessary in many sequencing-based studies. It involves sorting and merging genomic data from different subjects by their genomic locations. In particular, merging a large number of variant call format (VCF) files is frequently required in large-scale whole-genome sequencing or whole-exome sequencing projects. Traditional single-machine based methods become increasingly inefficient when processing large numbers of files due to the excessive computation time and Input/Output bottleneck. Distributed systems and more recent cloud-based systems offer an attractive solution. However, carefully designed and optimized workflow patterns and execution plans (schemas) are required to take full advantage of the increased computing power while overcoming bottlenecks to achieve high performance. FINDINGS: In this study, we custom-design optimized schemas for three Apache big data platforms, Hadoop (MapReduce), HBase, and Spark, to perform sorted merging of a large number of VCF files. These schemas all adopt the divide-and-conquer strategy to split the merging job into sequential phases/stages consisting of subtasks that are conquered in an ordered, parallel, and bottleneck-free way. In two illustrating examples, we test the performance of our schemas on merging multiple VCF files into either a single TPED or a single VCF file, which are benchmarked with the traditional single/parallel multiway-merge methods, message passing interface (MPI)–based high-performance computing (HPC) implementation, and the popular VCFTools. CONCLUSIONS: Our experiments suggest all three schemas either deliver a significant improvement in efficiency or render much better strong and weak scalabilities over traditional methods. Our findings provide generalized scalable schemas for performing sorted merging on genetics and genomics data using these Apache distributed systems.
format Online
Article
Text
id pubmed-6007233
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-60072332018-06-25 Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files Sun, Xiaobo Gao, Jingjing Jin, Peng Eng, Celeste Burchard, Esteban G Beaty, Terri H Ruczinski, Ingo Mathias, Rasika A Barnes, Kathleen Wang, Fusheng Qin, Zhaohui S Gigascience Technical Note BACKGROUND: Sorted merging of genomic data is a common data operation necessary in many sequencing-based studies. It involves sorting and merging genomic data from different subjects by their genomic locations. In particular, merging a large number of variant call format (VCF) files is frequently required in large-scale whole-genome sequencing or whole-exome sequencing projects. Traditional single-machine based methods become increasingly inefficient when processing large numbers of files due to the excessive computation time and Input/Output bottleneck. Distributed systems and more recent cloud-based systems offer an attractive solution. However, carefully designed and optimized workflow patterns and execution plans (schemas) are required to take full advantage of the increased computing power while overcoming bottlenecks to achieve high performance. FINDINGS: In this study, we custom-design optimized schemas for three Apache big data platforms, Hadoop (MapReduce), HBase, and Spark, to perform sorted merging of a large number of VCF files. These schemas all adopt the divide-and-conquer strategy to split the merging job into sequential phases/stages consisting of subtasks that are conquered in an ordered, parallel, and bottleneck-free way. In two illustrating examples, we test the performance of our schemas on merging multiple VCF files into either a single TPED or a single VCF file, which are benchmarked with the traditional single/parallel multiway-merge methods, message passing interface (MPI)–based high-performance computing (HPC) implementation, and the popular VCFTools. CONCLUSIONS: Our experiments suggest all three schemas either deliver a significant improvement in efficiency or render much better strong and weak scalabilities over traditional methods. Our findings provide generalized scalable schemas for performing sorted merging on genetics and genomics data using these Apache distributed systems. Oxford University Press 2018-05-10 /pmc/articles/PMC6007233/ /pubmed/29762754 http://dx.doi.org/10.1093/gigascience/giy052 Text en © The Authors 2018. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Technical Note
Sun, Xiaobo
Gao, Jingjing
Jin, Peng
Eng, Celeste
Burchard, Esteban G
Beaty, Terri H
Ruczinski, Ingo
Mathias, Rasika A
Barnes, Kathleen
Wang, Fusheng
Qin, Zhaohui S
Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files
title Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files
title_full Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files
title_fullStr Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files
title_full_unstemmed Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files
title_short Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files
title_sort optimized distributed systems achieve significant performance improvement on sorted merging of massive vcf files
topic Technical Note
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6007233/
https://www.ncbi.nlm.nih.gov/pubmed/29762754
http://dx.doi.org/10.1093/gigascience/giy052
work_keys_str_mv AT sunxiaobo optimizeddistributedsystemsachievesignificantperformanceimprovementonsortedmergingofmassivevcffiles
AT gaojingjing optimizeddistributedsystemsachievesignificantperformanceimprovementonsortedmergingofmassivevcffiles
AT jinpeng optimizeddistributedsystemsachievesignificantperformanceimprovementonsortedmergingofmassivevcffiles
AT engceleste optimizeddistributedsystemsachievesignificantperformanceimprovementonsortedmergingofmassivevcffiles
AT burchardestebang optimizeddistributedsystemsachievesignificantperformanceimprovementonsortedmergingofmassivevcffiles
AT beatyterrih optimizeddistributedsystemsachievesignificantperformanceimprovementonsortedmergingofmassivevcffiles
AT ruczinskiingo optimizeddistributedsystemsachievesignificantperformanceimprovementonsortedmergingofmassivevcffiles
AT mathiasrasikaa optimizeddistributedsystemsachievesignificantperformanceimprovementonsortedmergingofmassivevcffiles
AT barneskathleen optimizeddistributedsystemsachievesignificantperformanceimprovementonsortedmergingofmassivevcffiles
AT wangfusheng optimizeddistributedsystemsachievesignificantperformanceimprovementonsortedmergingofmassivevcffiles
AT qinzhaohuis optimizeddistributedsystemsachievesignificantperformanceimprovementonsortedmergingofmassivevcffiles
AT optimizeddistributedsystemsachievesignificantperformanceimprovementonsortedmergingofmassivevcffiles