Cargando…

repo: an R package for data-centered management of bioinformatic pipelines

BACKGROUND: Reproducibility in Data Analysis research has long been a significant concern, particularly in the areas of Bioinformatics and Computational Biology. Towards the aim of developing reproducible and reusable processes, Data Analysis management tools can help giving structure and coherence...

Descripción completa

Detalles Bibliográficos
Autor principal: Napolitano, Francesco
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5314482/
https://www.ncbi.nlm.nih.gov/pubmed/28209127
http://dx.doi.org/10.1186/s12859-017-1510-6
_version_ 1782508528337944576
author Napolitano, Francesco
author_facet Napolitano, Francesco
author_sort Napolitano, Francesco
collection PubMed
description BACKGROUND: Reproducibility in Data Analysis research has long been a significant concern, particularly in the areas of Bioinformatics and Computational Biology. Towards the aim of developing reproducible and reusable processes, Data Analysis management tools can help giving structure and coherence to complex data flows. Nonetheless, improved software quality comes at the cost of additional design and planning effort, which may become impractical in rapidly changing development environments. I propose that an adjustment of focus from processes to data in the management of Bioinformatic pipelines may help improving reproducibility with minimal impact on preexisting development practices. RESULTS: In this paper I introduce the repo R package for bioinformatic analysis management. The tool supports a data-centered philosophy that aims at improving analysis reproducibility and reusability with minimal design overhead. The core of repo lies in its support for easy data storage, retrieval, distribution and annotation. In repo the data analysis flow is derived a posteriori from dependency annotations. CONCLUSIONS: The repo package constitutes an unobtrusive data and flow management extension of the R statistical language. Its adoption, together with good development practices, can help improving data analysis management, sharing and reproducibility, especially in the fields of Bioinformatics and Computational Biology.
format Online
Article
Text
id pubmed-5314482
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-53144822017-02-24 repo: an R package for data-centered management of bioinformatic pipelines Napolitano, Francesco BMC Bioinformatics Software BACKGROUND: Reproducibility in Data Analysis research has long been a significant concern, particularly in the areas of Bioinformatics and Computational Biology. Towards the aim of developing reproducible and reusable processes, Data Analysis management tools can help giving structure and coherence to complex data flows. Nonetheless, improved software quality comes at the cost of additional design and planning effort, which may become impractical in rapidly changing development environments. I propose that an adjustment of focus from processes to data in the management of Bioinformatic pipelines may help improving reproducibility with minimal impact on preexisting development practices. RESULTS: In this paper I introduce the repo R package for bioinformatic analysis management. The tool supports a data-centered philosophy that aims at improving analysis reproducibility and reusability with minimal design overhead. The core of repo lies in its support for easy data storage, retrieval, distribution and annotation. In repo the data analysis flow is derived a posteriori from dependency annotations. CONCLUSIONS: The repo package constitutes an unobtrusive data and flow management extension of the R statistical language. Its adoption, together with good development practices, can help improving data analysis management, sharing and reproducibility, especially in the fields of Bioinformatics and Computational Biology. BioMed Central 2017-02-16 /pmc/articles/PMC5314482/ /pubmed/28209127 http://dx.doi.org/10.1186/s12859-017-1510-6 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Software
Napolitano, Francesco
repo: an R package for data-centered management of bioinformatic pipelines
title repo: an R package for data-centered management of bioinformatic pipelines
title_full repo: an R package for data-centered management of bioinformatic pipelines
title_fullStr repo: an R package for data-centered management of bioinformatic pipelines
title_full_unstemmed repo: an R package for data-centered management of bioinformatic pipelines
title_short repo: an R package for data-centered management of bioinformatic pipelines
title_sort repo: an r package for data-centered management of bioinformatic pipelines
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5314482/
https://www.ncbi.nlm.nih.gov/pubmed/28209127
http://dx.doi.org/10.1186/s12859-017-1510-6
work_keys_str_mv AT napolitanofrancesco repoanrpackagefordatacenteredmanagementofbioinformaticpipelines