Cargando…

Scalable Workflows and Reproducible Data Analysis for Genomics

Biological, clinical, and pharmacological research now often involves analyses of genomes, transcriptomes, proteomes, and interactomes, within and between individuals and across species. Due to large volumes, the analysis and integration of data generated by such high-throughput technologies have be...

Descripción completa

Detalles Bibliográficos
Autores principales:	Strozzi, Francesco, Janssen, Roel, Wurmus, Ricardo, Crusoe, Michael R., Githinji, George, Di Tommaso, Paolo, Belhachemi, Dominique, Möller, Steffen, Smant, Geert, de Ligt, Joep, Prins, Pjotr
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	2019
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7613310/ https://www.ncbi.nlm.nih.gov/pubmed/31278683 http://dx.doi.org/10.1007/978-1-4939-9074-0_24

_version_	1783605465004900352
author	Strozzi, Francesco Janssen, Roel Wurmus, Ricardo Crusoe, Michael R. Githinji, George Di Tommaso, Paolo Belhachemi, Dominique Möller, Steffen Smant, Geert de Ligt, Joep Prins, Pjotr
author_facet	Strozzi, Francesco Janssen, Roel Wurmus, Ricardo Crusoe, Michael R. Githinji, George Di Tommaso, Paolo Belhachemi, Dominique Möller, Steffen Smant, Geert de Ligt, Joep Prins, Pjotr
author_sort	Strozzi, Francesco
collection	PubMed
description	Biological, clinical, and pharmacological research now often involves analyses of genomes, transcriptomes, proteomes, and interactomes, within and between individuals and across species. Due to large volumes, the analysis and integration of data generated by such high-throughput technologies have become computationally intensive, and analysis can no longer happen on a typical desktop computer. In this chapter we show how to describe and execute the same analysis using a number of workflow systems and how these follow different approaches to tackle execution and reproducibility issues. We show how any researcher can create a reusable and reproducible bioinformatics pipeline that can be deployed and run anywhere. We show how to create a scalable, reusable, and shareable workflow using four different workflow engines: the Common Workflow Language (CWL), Guix Workflow Language (GWL), Snakemake, and Nextflow. Each of which can be run in parallel. We show how to bundle a number of tools used in evolutionary biology by using Debian, GNU Guix, and Bioconda software distributions, along with the use of container systems, such as Docker, GNU Guix, and Singularity. Together these distributions represent the overall majority of software packages relevant for biology, including PAML, Muscle, MAFFT, MrBayes, and BLAST. By bundling software in lightweight containers, they can be deployed on a desktop, in the cloud, and, increasingly, on compute clusters. By bundling software through these public software distributions, and by creating reproducible and shareable pipelines using these workflow engines, not only do bioinformaticians have to spend less time reinventing the wheel but also do we get closer to the ideal of making science reproducible. The examples in this chapter allow a quick comparison of different solutions.
format	Online Article Text
id	pubmed-7613310
institution	National Center for Biotechnology Information
language	English
publishDate	2019
record_format	MEDLINE/PubMed
spelling	pubmed-76133102022-09-07 Scalable Workflows and Reproducible Data Analysis for Genomics Strozzi, Francesco Janssen, Roel Wurmus, Ricardo Crusoe, Michael R. Githinji, George Di Tommaso, Paolo Belhachemi, Dominique Möller, Steffen Smant, Geert de Ligt, Joep Prins, Pjotr Methods Mol Biol Article Biological, clinical, and pharmacological research now often involves analyses of genomes, transcriptomes, proteomes, and interactomes, within and between individuals and across species. Due to large volumes, the analysis and integration of data generated by such high-throughput technologies have become computationally intensive, and analysis can no longer happen on a typical desktop computer. In this chapter we show how to describe and execute the same analysis using a number of workflow systems and how these follow different approaches to tackle execution and reproducibility issues. We show how any researcher can create a reusable and reproducible bioinformatics pipeline that can be deployed and run anywhere. We show how to create a scalable, reusable, and shareable workflow using four different workflow engines: the Common Workflow Language (CWL), Guix Workflow Language (GWL), Snakemake, and Nextflow. Each of which can be run in parallel. We show how to bundle a number of tools used in evolutionary biology by using Debian, GNU Guix, and Bioconda software distributions, along with the use of container systems, such as Docker, GNU Guix, and Singularity. Together these distributions represent the overall majority of software packages relevant for biology, including PAML, Muscle, MAFFT, MrBayes, and BLAST. By bundling software in lightweight containers, they can be deployed on a desktop, in the cloud, and, increasingly, on compute clusters. By bundling software through these public software distributions, and by creating reproducible and shareable pipelines using these workflow engines, not only do bioinformaticians have to spend less time reinventing the wheel but also do we get closer to the ideal of making science reproducible. The examples in this chapter allow a quick comparison of different solutions. 2019-01-01 /pmc/articles/PMC7613310/ /pubmed/31278683 http://dx.doi.org/10.1007/978-1-4939-9074-0_24 Text en https://creativecommons.org/licenses/by/4.0/This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) ), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
spellingShingle	Article Strozzi, Francesco Janssen, Roel Wurmus, Ricardo Crusoe, Michael R. Githinji, George Di Tommaso, Paolo Belhachemi, Dominique Möller, Steffen Smant, Geert de Ligt, Joep Prins, Pjotr Scalable Workflows and Reproducible Data Analysis for Genomics
title	Scalable Workflows and Reproducible Data Analysis for Genomics
title_full	Scalable Workflows and Reproducible Data Analysis for Genomics
title_fullStr	Scalable Workflows and Reproducible Data Analysis for Genomics
title_full_unstemmed	Scalable Workflows and Reproducible Data Analysis for Genomics
title_short	Scalable Workflows and Reproducible Data Analysis for Genomics
title_sort	scalable workflows and reproducible data analysis for genomics
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7613310/ https://www.ncbi.nlm.nih.gov/pubmed/31278683 http://dx.doi.org/10.1007/978-1-4939-9074-0_24
work_keys_str_mv	AT strozzifrancesco scalableworkflowsandreproducibledataanalysisforgenomics AT janssenroel scalableworkflowsandreproducibledataanalysisforgenomics AT wurmusricardo scalableworkflowsandreproducibledataanalysisforgenomics AT crusoemichaelr scalableworkflowsandreproducibledataanalysisforgenomics AT githinjigeorge scalableworkflowsandreproducibledataanalysisforgenomics AT ditommasopaolo scalableworkflowsandreproducibledataanalysisforgenomics AT belhachemidominique scalableworkflowsandreproducibledataanalysisforgenomics AT mollersteffen scalableworkflowsandreproducibledataanalysisforgenomics AT smantgeert scalableworkflowsandreproducibledataanalysisforgenomics AT deligtjoep scalableworkflowsandreproducibledataanalysisforgenomics AT prinspjotr scalableworkflowsandreproducibledataanalysisforgenomics

Scalable Workflows and Reproducible Data Analysis for Genomics

Ejemplares similares