Cargando…

Sustainable data analysis with Snakemake

Data analysis often entails a multitude of heterogeneous steps, from the application of various command line tools to the usage of scripting languages like R or Python for the generation of plots and tables. It is widely recognized that data analyses should ideally be conducted in a reproducible way...

Descripción completa

Detalles Bibliográficos
Autores principales: Mölder, Felix, Jablonski, Kim Philipp, Letcher, Brice, Hall, Michael B., Tomkins-Tinch, Christopher H., Sochat, Vanessa, Forster, Jan, Lee, Soohyun, Twardziok, Sven O., Kanitz, Alexander, Wilm, Andreas, Holtgrewe, Manuel, Rahmann, Sven, Nahnsen, Sven, Köster, Johannes
Formato: Online Artículo Texto
Lenguaje:English
Publicado: F1000 Research Limited 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8114187/
https://www.ncbi.nlm.nih.gov/pubmed/34035898
http://dx.doi.org/10.12688/f1000research.29032.2
_version_ 1783691010937716736
author Mölder, Felix
Jablonski, Kim Philipp
Letcher, Brice
Hall, Michael B.
Tomkins-Tinch, Christopher H.
Sochat, Vanessa
Forster, Jan
Lee, Soohyun
Twardziok, Sven O.
Kanitz, Alexander
Wilm, Andreas
Holtgrewe, Manuel
Rahmann, Sven
Nahnsen, Sven
Köster, Johannes
author_facet Mölder, Felix
Jablonski, Kim Philipp
Letcher, Brice
Hall, Michael B.
Tomkins-Tinch, Christopher H.
Sochat, Vanessa
Forster, Jan
Lee, Soohyun
Twardziok, Sven O.
Kanitz, Alexander
Wilm, Andreas
Holtgrewe, Manuel
Rahmann, Sven
Nahnsen, Sven
Köster, Johannes
author_sort Mölder, Felix
collection PubMed
description Data analysis often entails a multitude of heterogeneous steps, from the application of various command line tools to the usage of scripting languages like R or Python for the generation of plots and tables. It is widely recognized that data analyses should ideally be conducted in a reproducible way. Reproducibility enables technical validation and regeneration of results on the original or even new data. However, reproducibility alone is by no means sufficient to deliver an analysis that is of lasting impact (i.e., sustainable) for the field, or even just one research group. We postulate that it is equally important to ensure adaptability and transparency. The former describes the ability to modify the analysis to answer extended or slightly different research questions. The latter describes the ability to understand the analysis in order to judge whether it is not only technically, but methodologically valid. Here, we analyze the properties needed for a data analysis to become reproducible, adaptable, and transparent. We show how the popular workflow management system Snakemake can be used to guarantee this, and how it enables an ergonomic, combined, unified representation of all steps involved in data analysis, ranging from raw data processing, to quality control and fine-grained, interactive exploration and plotting of final results.
format Online
Article
Text
id pubmed-8114187
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher F1000 Research Limited
record_format MEDLINE/PubMed
spelling pubmed-81141872021-05-24 Sustainable data analysis with Snakemake Mölder, Felix Jablonski, Kim Philipp Letcher, Brice Hall, Michael B. Tomkins-Tinch, Christopher H. Sochat, Vanessa Forster, Jan Lee, Soohyun Twardziok, Sven O. Kanitz, Alexander Wilm, Andreas Holtgrewe, Manuel Rahmann, Sven Nahnsen, Sven Köster, Johannes F1000Res Method Article Data analysis often entails a multitude of heterogeneous steps, from the application of various command line tools to the usage of scripting languages like R or Python for the generation of plots and tables. It is widely recognized that data analyses should ideally be conducted in a reproducible way. Reproducibility enables technical validation and regeneration of results on the original or even new data. However, reproducibility alone is by no means sufficient to deliver an analysis that is of lasting impact (i.e., sustainable) for the field, or even just one research group. We postulate that it is equally important to ensure adaptability and transparency. The former describes the ability to modify the analysis to answer extended or slightly different research questions. The latter describes the ability to understand the analysis in order to judge whether it is not only technically, but methodologically valid. Here, we analyze the properties needed for a data analysis to become reproducible, adaptable, and transparent. We show how the popular workflow management system Snakemake can be used to guarantee this, and how it enables an ergonomic, combined, unified representation of all steps involved in data analysis, ranging from raw data processing, to quality control and fine-grained, interactive exploration and plotting of final results. F1000 Research Limited 2021-04-19 /pmc/articles/PMC8114187/ /pubmed/34035898 http://dx.doi.org/10.12688/f1000research.29032.2 Text en Copyright: © 2021 Mölder F et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Method Article
Mölder, Felix
Jablonski, Kim Philipp
Letcher, Brice
Hall, Michael B.
Tomkins-Tinch, Christopher H.
Sochat, Vanessa
Forster, Jan
Lee, Soohyun
Twardziok, Sven O.
Kanitz, Alexander
Wilm, Andreas
Holtgrewe, Manuel
Rahmann, Sven
Nahnsen, Sven
Köster, Johannes
Sustainable data analysis with Snakemake
title Sustainable data analysis with Snakemake
title_full Sustainable data analysis with Snakemake
title_fullStr Sustainable data analysis with Snakemake
title_full_unstemmed Sustainable data analysis with Snakemake
title_short Sustainable data analysis with Snakemake
title_sort sustainable data analysis with snakemake
topic Method Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8114187/
https://www.ncbi.nlm.nih.gov/pubmed/34035898
http://dx.doi.org/10.12688/f1000research.29032.2
work_keys_str_mv AT molderfelix sustainabledataanalysiswithsnakemake
AT jablonskikimphilipp sustainabledataanalysiswithsnakemake
AT letcherbrice sustainabledataanalysiswithsnakemake
AT hallmichaelb sustainabledataanalysiswithsnakemake
AT tomkinstinchchristopherh sustainabledataanalysiswithsnakemake
AT sochatvanessa sustainabledataanalysiswithsnakemake
AT forsterjan sustainabledataanalysiswithsnakemake
AT leesoohyun sustainabledataanalysiswithsnakemake
AT twardzioksveno sustainabledataanalysiswithsnakemake
AT kanitzalexander sustainabledataanalysiswithsnakemake
AT wilmandreas sustainabledataanalysiswithsnakemake
AT holtgrewemanuel sustainabledataanalysiswithsnakemake
AT rahmannsven sustainabledataanalysiswithsnakemake
AT nahnsensven sustainabledataanalysiswithsnakemake
AT kosterjohannes sustainabledataanalysiswithsnakemake