Cargando…
Sustainable data analysis with Snakemake
Data analysis often entails a multitude of heterogeneous steps, from the application of various command line tools to the usage of scripting languages like R or Python for the generation of plots and tables. It is widely recognized that data analyses should ideally be conducted in a reproducible way...
Autores principales: | , , , , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
F1000 Research Limited
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8114187/ https://www.ncbi.nlm.nih.gov/pubmed/34035898 http://dx.doi.org/10.12688/f1000research.29032.2 |
_version_ | 1783691010937716736 |
---|---|
author | Mölder, Felix Jablonski, Kim Philipp Letcher, Brice Hall, Michael B. Tomkins-Tinch, Christopher H. Sochat, Vanessa Forster, Jan Lee, Soohyun Twardziok, Sven O. Kanitz, Alexander Wilm, Andreas Holtgrewe, Manuel Rahmann, Sven Nahnsen, Sven Köster, Johannes |
author_facet | Mölder, Felix Jablonski, Kim Philipp Letcher, Brice Hall, Michael B. Tomkins-Tinch, Christopher H. Sochat, Vanessa Forster, Jan Lee, Soohyun Twardziok, Sven O. Kanitz, Alexander Wilm, Andreas Holtgrewe, Manuel Rahmann, Sven Nahnsen, Sven Köster, Johannes |
author_sort | Mölder, Felix |
collection | PubMed |
description | Data analysis often entails a multitude of heterogeneous steps, from the application of various command line tools to the usage of scripting languages like R or Python for the generation of plots and tables. It is widely recognized that data analyses should ideally be conducted in a reproducible way. Reproducibility enables technical validation and regeneration of results on the original or even new data. However, reproducibility alone is by no means sufficient to deliver an analysis that is of lasting impact (i.e., sustainable) for the field, or even just one research group. We postulate that it is equally important to ensure adaptability and transparency. The former describes the ability to modify the analysis to answer extended or slightly different research questions. The latter describes the ability to understand the analysis in order to judge whether it is not only technically, but methodologically valid. Here, we analyze the properties needed for a data analysis to become reproducible, adaptable, and transparent. We show how the popular workflow management system Snakemake can be used to guarantee this, and how it enables an ergonomic, combined, unified representation of all steps involved in data analysis, ranging from raw data processing, to quality control and fine-grained, interactive exploration and plotting of final results. |
format | Online Article Text |
id | pubmed-8114187 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | F1000 Research Limited |
record_format | MEDLINE/PubMed |
spelling | pubmed-81141872021-05-24 Sustainable data analysis with Snakemake Mölder, Felix Jablonski, Kim Philipp Letcher, Brice Hall, Michael B. Tomkins-Tinch, Christopher H. Sochat, Vanessa Forster, Jan Lee, Soohyun Twardziok, Sven O. Kanitz, Alexander Wilm, Andreas Holtgrewe, Manuel Rahmann, Sven Nahnsen, Sven Köster, Johannes F1000Res Method Article Data analysis often entails a multitude of heterogeneous steps, from the application of various command line tools to the usage of scripting languages like R or Python for the generation of plots and tables. It is widely recognized that data analyses should ideally be conducted in a reproducible way. Reproducibility enables technical validation and regeneration of results on the original or even new data. However, reproducibility alone is by no means sufficient to deliver an analysis that is of lasting impact (i.e., sustainable) for the field, or even just one research group. We postulate that it is equally important to ensure adaptability and transparency. The former describes the ability to modify the analysis to answer extended or slightly different research questions. The latter describes the ability to understand the analysis in order to judge whether it is not only technically, but methodologically valid. Here, we analyze the properties needed for a data analysis to become reproducible, adaptable, and transparent. We show how the popular workflow management system Snakemake can be used to guarantee this, and how it enables an ergonomic, combined, unified representation of all steps involved in data analysis, ranging from raw data processing, to quality control and fine-grained, interactive exploration and plotting of final results. F1000 Research Limited 2021-04-19 /pmc/articles/PMC8114187/ /pubmed/34035898 http://dx.doi.org/10.12688/f1000research.29032.2 Text en Copyright: © 2021 Mölder F et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Method Article Mölder, Felix Jablonski, Kim Philipp Letcher, Brice Hall, Michael B. Tomkins-Tinch, Christopher H. Sochat, Vanessa Forster, Jan Lee, Soohyun Twardziok, Sven O. Kanitz, Alexander Wilm, Andreas Holtgrewe, Manuel Rahmann, Sven Nahnsen, Sven Köster, Johannes Sustainable data analysis with Snakemake |
title | Sustainable data analysis with Snakemake |
title_full | Sustainable data analysis with Snakemake |
title_fullStr | Sustainable data analysis with Snakemake |
title_full_unstemmed | Sustainable data analysis with Snakemake |
title_short | Sustainable data analysis with Snakemake |
title_sort | sustainable data analysis with snakemake |
topic | Method Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8114187/ https://www.ncbi.nlm.nih.gov/pubmed/34035898 http://dx.doi.org/10.12688/f1000research.29032.2 |
work_keys_str_mv | AT molderfelix sustainabledataanalysiswithsnakemake AT jablonskikimphilipp sustainabledataanalysiswithsnakemake AT letcherbrice sustainabledataanalysiswithsnakemake AT hallmichaelb sustainabledataanalysiswithsnakemake AT tomkinstinchchristopherh sustainabledataanalysiswithsnakemake AT sochatvanessa sustainabledataanalysiswithsnakemake AT forsterjan sustainabledataanalysiswithsnakemake AT leesoohyun sustainabledataanalysiswithsnakemake AT twardzioksveno sustainabledataanalysiswithsnakemake AT kanitzalexander sustainabledataanalysiswithsnakemake AT wilmandreas sustainabledataanalysiswithsnakemake AT holtgrewemanuel sustainabledataanalysiswithsnakemake AT rahmannsven sustainabledataanalysiswithsnakemake AT nahnsensven sustainabledataanalysiswithsnakemake AT kosterjohannes sustainabledataanalysiswithsnakemake |