Cargando…

Managing genomic variant calling workflows with Swift/T

Bioinformatics research is frequently performed using complex workflows with multiple steps, fans, merges, and conditionals. This complexity makes management of the workflow difficult on a computer cluster, especially when running in parallel on large batches of data: hundreds or thousands of sample...

Descripción completa

Detalles Bibliográficos
Autores principales: Ahmed, Azza E., Heldenbrand, Jacob, Asmann, Yan, Fadlelmola, Faisal M., Katz, Daniel S., Kendig, Katherine, Kendzior, Matthew C., Li, Tiffany, Ren, Yingxue, Rodriguez, Elliott, Weber, Matthew R., Wozniak, Justin M., Zermeno, Jennie, Mainzer, Liudmila S.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6615596/
https://www.ncbi.nlm.nih.gov/pubmed/31287816
http://dx.doi.org/10.1371/journal.pone.0211608
_version_ 1783433379292643328
author Ahmed, Azza E.
Heldenbrand, Jacob
Asmann, Yan
Fadlelmola, Faisal M.
Katz, Daniel S.
Kendig, Katherine
Kendzior, Matthew C.
Li, Tiffany
Ren, Yingxue
Rodriguez, Elliott
Weber, Matthew R.
Wozniak, Justin M.
Zermeno, Jennie
Mainzer, Liudmila S.
author_facet Ahmed, Azza E.
Heldenbrand, Jacob
Asmann, Yan
Fadlelmola, Faisal M.
Katz, Daniel S.
Kendig, Katherine
Kendzior, Matthew C.
Li, Tiffany
Ren, Yingxue
Rodriguez, Elliott
Weber, Matthew R.
Wozniak, Justin M.
Zermeno, Jennie
Mainzer, Liudmila S.
author_sort Ahmed, Azza E.
collection PubMed
description Bioinformatics research is frequently performed using complex workflows with multiple steps, fans, merges, and conditionals. This complexity makes management of the workflow difficult on a computer cluster, especially when running in parallel on large batches of data: hundreds or thousands of samples at a time. Scientific workflow management systems could help with that. Many are now being proposed, but is there yet the “best” workflow management system for bioinformatics? Such a system would need to satisfy numerous, sometimes conflicting requirements: from ease of use, to seamless deployment at peta- and exa-scale, and portability to the cloud. We evaluated Swift/T as a candidate for such role by implementing a primary genomic variant calling workflow in the Swift/T language, focusing on workflow management, performance and scalability issues that arise from production-grade big data genomic analyses. In the process we introduced novel features into the language, which are now part of its open repository. Additionally, we formalized a set of design criteria for quality, robust, maintainable workflows that must function at-scale in a production setting, such as a large genomic sequencing facility or a major hospital system. The use of Swift/T conveys two key advantages. (1) It operates transparently in multiple cluster scheduling environments (PBS Torque, SLURM, Cray aprun environment, etc.), thus a single workflow is trivially portable across numerous clusters. (2) The leaf functions of Swift/T permit developers to easily swap executables in and out of the workflow, which makes it easy to maintain and to request resources optimal for each stage of the pipeline. While Swift/T’s data-level parallelism eliminates the need to code parallel analysis of multiple samples, it does make debugging more difficult, as is common for implicitly parallel code. Nonetheless, the language gives users a powerful and portable way to scale up analyses in many computing architectures. The code for our implementation of a variant calling workflow using Swift/T can be found on GitHub at https://github.com/ncsa/Swift-T-Variant-Calling, with full documentation provided at http://swift-t-variant-calling.readthedocs.io/en/latest/.
format Online
Article
Text
id pubmed-6615596
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-66155962019-07-25 Managing genomic variant calling workflows with Swift/T Ahmed, Azza E. Heldenbrand, Jacob Asmann, Yan Fadlelmola, Faisal M. Katz, Daniel S. Kendig, Katherine Kendzior, Matthew C. Li, Tiffany Ren, Yingxue Rodriguez, Elliott Weber, Matthew R. Wozniak, Justin M. Zermeno, Jennie Mainzer, Liudmila S. PLoS One Research Article Bioinformatics research is frequently performed using complex workflows with multiple steps, fans, merges, and conditionals. This complexity makes management of the workflow difficult on a computer cluster, especially when running in parallel on large batches of data: hundreds or thousands of samples at a time. Scientific workflow management systems could help with that. Many are now being proposed, but is there yet the “best” workflow management system for bioinformatics? Such a system would need to satisfy numerous, sometimes conflicting requirements: from ease of use, to seamless deployment at peta- and exa-scale, and portability to the cloud. We evaluated Swift/T as a candidate for such role by implementing a primary genomic variant calling workflow in the Swift/T language, focusing on workflow management, performance and scalability issues that arise from production-grade big data genomic analyses. In the process we introduced novel features into the language, which are now part of its open repository. Additionally, we formalized a set of design criteria for quality, robust, maintainable workflows that must function at-scale in a production setting, such as a large genomic sequencing facility or a major hospital system. The use of Swift/T conveys two key advantages. (1) It operates transparently in multiple cluster scheduling environments (PBS Torque, SLURM, Cray aprun environment, etc.), thus a single workflow is trivially portable across numerous clusters. (2) The leaf functions of Swift/T permit developers to easily swap executables in and out of the workflow, which makes it easy to maintain and to request resources optimal for each stage of the pipeline. While Swift/T’s data-level parallelism eliminates the need to code parallel analysis of multiple samples, it does make debugging more difficult, as is common for implicitly parallel code. Nonetheless, the language gives users a powerful and portable way to scale up analyses in many computing architectures. The code for our implementation of a variant calling workflow using Swift/T can be found on GitHub at https://github.com/ncsa/Swift-T-Variant-Calling, with full documentation provided at http://swift-t-variant-calling.readthedocs.io/en/latest/. Public Library of Science 2019-07-09 /pmc/articles/PMC6615596/ /pubmed/31287816 http://dx.doi.org/10.1371/journal.pone.0211608 Text en © 2019 Ahmed et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Ahmed, Azza E.
Heldenbrand, Jacob
Asmann, Yan
Fadlelmola, Faisal M.
Katz, Daniel S.
Kendig, Katherine
Kendzior, Matthew C.
Li, Tiffany
Ren, Yingxue
Rodriguez, Elliott
Weber, Matthew R.
Wozniak, Justin M.
Zermeno, Jennie
Mainzer, Liudmila S.
Managing genomic variant calling workflows with Swift/T
title Managing genomic variant calling workflows with Swift/T
title_full Managing genomic variant calling workflows with Swift/T
title_fullStr Managing genomic variant calling workflows with Swift/T
title_full_unstemmed Managing genomic variant calling workflows with Swift/T
title_short Managing genomic variant calling workflows with Swift/T
title_sort managing genomic variant calling workflows with swift/t
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6615596/
https://www.ncbi.nlm.nih.gov/pubmed/31287816
http://dx.doi.org/10.1371/journal.pone.0211608
work_keys_str_mv AT ahmedazzae managinggenomicvariantcallingworkflowswithswiftt
AT heldenbrandjacob managinggenomicvariantcallingworkflowswithswiftt
AT asmannyan managinggenomicvariantcallingworkflowswithswiftt
AT fadlelmolafaisalm managinggenomicvariantcallingworkflowswithswiftt
AT katzdaniels managinggenomicvariantcallingworkflowswithswiftt
AT kendigkatherine managinggenomicvariantcallingworkflowswithswiftt
AT kendziormatthewc managinggenomicvariantcallingworkflowswithswiftt
AT litiffany managinggenomicvariantcallingworkflowswithswiftt
AT renyingxue managinggenomicvariantcallingworkflowswithswiftt
AT rodriguezelliott managinggenomicvariantcallingworkflowswithswiftt
AT webermatthewr managinggenomicvariantcallingworkflowswithswiftt
AT wozniakjustinm managinggenomicvariantcallingworkflowswithswiftt
AT zermenojennie managinggenomicvariantcallingworkflowswithswiftt
AT mainzerliudmilas managinggenomicvariantcallingworkflowswithswiftt