Cargando…

DataPackageR: Reproducible data preprocessing, standardization and sharing using R/Bioconductor for collaborative data analysis

A central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly...

Descripción completa

Detalles Bibliográficos
Autores principales:	Finak, Greg, Mayer, Bryan, Fulp, William, Obrecht, Paul, Sato, Alicia, Chung, Eva, Holman, Drienna, Gottardo, Raphael
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	F1000 Research Limited 2018
Materias:	Software Tool Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6139382/ https://www.ncbi.nlm.nih.gov/pubmed/30234197 http://dx.doi.org/10.12688/gatesopenres.12832.2

_version_	1783355507224870912
author	Finak, Greg Mayer, Bryan Fulp, William Obrecht, Paul Sato, Alicia Chung, Eva Holman, Drienna Gottardo, Raphael
author_facet	Finak, Greg Mayer, Bryan Fulp, William Obrecht, Paul Sato, Alicia Chung, Eva Holman, Drienna Gottardo, Raphael
author_sort	Finak, Greg
collection	PubMed
description	A central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly demanded that code and primary data be made available with publications. There has been little practical advice on implementing reproducible research work-flows for large ’omics’ or systems biology data sets used by teams of analysts working in collaboration. In such instances it is important to ensure all analysts use the same version of a data set for their analyses. Yet, instantiating relational databases and standard operating procedures can be unwieldy, with high "startup" costs and poor adherence to procedures when they deviate substantially from an analyst’s usual work-flow. Ideally a reproducible research work-flow should fit naturally into an individual’s existing work-flow, with minimal disruption. Here, we provide an overview of how we have leveraged popular open source tools, including Bioconductor, Rmarkdown, git version control, R, and specifically R’s package system combined with a new tool DataPackageR, to implement a lightweight reproducible research work-flow for preprocessing large data sets, suitable for sharing among small-to-medium sized teams of computational scientists. Our primary contribution is the DataPackageR tool, which decouples time-consuming data processing from data analysis while leaving a traceable record of how raw data is processed into analysis-ready data sets. The software ensures packaged data objects are properly documented and performs checksum verification of these along with basic package version management, and importantly, leaves a record of data processing code in the form of package vignettes. Our group has implemented this work-flow to manage, analyze and report on pre-clinical immunological trial data from multi-center, multi-assay studies for the past three years.
format	Online Article Text
id	pubmed-6139382
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	F1000 Research Limited
record_format	MEDLINE/PubMed
spelling	pubmed-61393822018-09-17 DataPackageR: Reproducible data preprocessing, standardization and sharing using R/Bioconductor for collaborative data analysis Finak, Greg Mayer, Bryan Fulp, William Obrecht, Paul Sato, Alicia Chung, Eva Holman, Drienna Gottardo, Raphael Gates Open Res Software Tool Article A central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly demanded that code and primary data be made available with publications. There has been little practical advice on implementing reproducible research work-flows for large ’omics’ or systems biology data sets used by teams of analysts working in collaboration. In such instances it is important to ensure all analysts use the same version of a data set for their analyses. Yet, instantiating relational databases and standard operating procedures can be unwieldy, with high "startup" costs and poor adherence to procedures when they deviate substantially from an analyst’s usual work-flow. Ideally a reproducible research work-flow should fit naturally into an individual’s existing work-flow, with minimal disruption. Here, we provide an overview of how we have leveraged popular open source tools, including Bioconductor, Rmarkdown, git version control, R, and specifically R’s package system combined with a new tool DataPackageR, to implement a lightweight reproducible research work-flow for preprocessing large data sets, suitable for sharing among small-to-medium sized teams of computational scientists. Our primary contribution is the DataPackageR tool, which decouples time-consuming data processing from data analysis while leaving a traceable record of how raw data is processed into analysis-ready data sets. The software ensures packaged data objects are properly documented and performs checksum verification of these along with basic package version management, and importantly, leaves a record of data processing code in the form of package vignettes. Our group has implemented this work-flow to manage, analyze and report on pre-clinical immunological trial data from multi-center, multi-assay studies for the past three years. F1000 Research Limited 2018-07-10 /pmc/articles/PMC6139382/ /pubmed/30234197 http://dx.doi.org/10.12688/gatesopenres.12832.2 Text en Copyright: © 2018 Finak G et al. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Software Tool Article Finak, Greg Mayer, Bryan Fulp, William Obrecht, Paul Sato, Alicia Chung, Eva Holman, Drienna Gottardo, Raphael DataPackageR: Reproducible data preprocessing, standardization and sharing using R/Bioconductor for collaborative data analysis
title	DataPackageR: Reproducible data preprocessing, standardization and sharing using R/Bioconductor for collaborative data analysis
title_full	DataPackageR: Reproducible data preprocessing, standardization and sharing using R/Bioconductor for collaborative data analysis
title_fullStr	DataPackageR: Reproducible data preprocessing, standardization and sharing using R/Bioconductor for collaborative data analysis
title_full_unstemmed	DataPackageR: Reproducible data preprocessing, standardization and sharing using R/Bioconductor for collaborative data analysis
title_short	DataPackageR: Reproducible data preprocessing, standardization and sharing using R/Bioconductor for collaborative data analysis
title_sort	datapackager: reproducible data preprocessing, standardization and sharing using r/bioconductor for collaborative data analysis
topic	Software Tool Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6139382/ https://www.ncbi.nlm.nih.gov/pubmed/30234197 http://dx.doi.org/10.12688/gatesopenres.12832.2
work_keys_str_mv	AT finakgreg datapackagerreproducibledatapreprocessingstandardizationandsharingusingrbioconductorforcollaborativedataanalysis AT mayerbryan datapackagerreproducibledatapreprocessingstandardizationandsharingusingrbioconductorforcollaborativedataanalysis AT fulpwilliam datapackagerreproducibledatapreprocessingstandardizationandsharingusingrbioconductorforcollaborativedataanalysis AT obrechtpaul datapackagerreproducibledatapreprocessingstandardizationandsharingusingrbioconductorforcollaborativedataanalysis AT satoalicia datapackagerreproducibledatapreprocessingstandardizationandsharingusingrbioconductorforcollaborativedataanalysis AT chungeva datapackagerreproducibledatapreprocessingstandardizationandsharingusingrbioconductorforcollaborativedataanalysis AT holmandrienna datapackagerreproducibledatapreprocessingstandardizationandsharingusingrbioconductorforcollaborativedataanalysis AT gottardoraphael datapackagerreproducibledatapreprocessingstandardizationandsharingusingrbioconductorforcollaborativedataanalysis

DataPackageR: Reproducible data preprocessing, standardization and sharing using R/Bioconductor for collaborative data analysis

Ejemplares similares