Cargando…

RepeatFS: a file system providing reproducibility through provenance and automation

MOTIVATION: Reproducibility is of central importance to the scientific process. The difficulty of consistently replicating and verifying experimental results is magnified in the era of big data, in which bioinformatics analysis often involves complex multi-application pipelines operating on terabyte...

Descripción completa

Detalles Bibliográficos
Autores principales: Westbrook, Anthony, Varki, Elizabeth, Thomas, W Kelley
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8189677/
https://www.ncbi.nlm.nih.gov/pubmed/33230554
http://dx.doi.org/10.1093/bioinformatics/btaa950
_version_ 1783705534935859200
author Westbrook, Anthony
Varki, Elizabeth
Thomas, W Kelley
author_facet Westbrook, Anthony
Varki, Elizabeth
Thomas, W Kelley
author_sort Westbrook, Anthony
collection PubMed
description MOTIVATION: Reproducibility is of central importance to the scientific process. The difficulty of consistently replicating and verifying experimental results is magnified in the era of big data, in which bioinformatics analysis often involves complex multi-application pipelines operating on terabytes of data. These processes result in thousands of possible permutations of data preparation steps, software versions and command-line arguments. Existing reproducibility frameworks are cumbersome and involve redesigning computational methods. To address these issues, we developed RepeatFS, a file system that records, replicates and verifies informatics workflows with no alteration to the original methods. RepeatFS also provides several other features to help promote analytical transparency and reproducibility, including provenance visualization and task automation. RESULTS: We used RepeatFS to successfully visualize and replicate a variety of bioinformatics tasks consisting of over a million operations with no alteration to the original methods. RepeatFS correctly identified all software inconsistencies that resulted in replication differences. AVAILABILITYAND IMPLEMENTATION: RepeatFS is implemented in Python 3. Its source code and documentation are available at https://github.com/ToniWestbrook/repeatfs. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-8189677
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-81896772021-06-10 RepeatFS: a file system providing reproducibility through provenance and automation Westbrook, Anthony Varki, Elizabeth Thomas, W Kelley Bioinformatics Original Papers MOTIVATION: Reproducibility is of central importance to the scientific process. The difficulty of consistently replicating and verifying experimental results is magnified in the era of big data, in which bioinformatics analysis often involves complex multi-application pipelines operating on terabytes of data. These processes result in thousands of possible permutations of data preparation steps, software versions and command-line arguments. Existing reproducibility frameworks are cumbersome and involve redesigning computational methods. To address these issues, we developed RepeatFS, a file system that records, replicates and verifies informatics workflows with no alteration to the original methods. RepeatFS also provides several other features to help promote analytical transparency and reproducibility, including provenance visualization and task automation. RESULTS: We used RepeatFS to successfully visualize and replicate a variety of bioinformatics tasks consisting of over a million operations with no alteration to the original methods. RepeatFS correctly identified all software inconsistencies that resulted in replication differences. AVAILABILITYAND IMPLEMENTATION: RepeatFS is implemented in Python 3. Its source code and documentation are available at https://github.com/ToniWestbrook/repeatfs. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2020-11-24 /pmc/articles/PMC8189677/ /pubmed/33230554 http://dx.doi.org/10.1093/bioinformatics/btaa950 Text en © The Author(s) 2020. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) ), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Westbrook, Anthony
Varki, Elizabeth
Thomas, W Kelley
RepeatFS: a file system providing reproducibility through provenance and automation
title RepeatFS: a file system providing reproducibility through provenance and automation
title_full RepeatFS: a file system providing reproducibility through provenance and automation
title_fullStr RepeatFS: a file system providing reproducibility through provenance and automation
title_full_unstemmed RepeatFS: a file system providing reproducibility through provenance and automation
title_short RepeatFS: a file system providing reproducibility through provenance and automation
title_sort repeatfs: a file system providing reproducibility through provenance and automation
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8189677/
https://www.ncbi.nlm.nih.gov/pubmed/33230554
http://dx.doi.org/10.1093/bioinformatics/btaa950
work_keys_str_mv AT westbrookanthony repeatfsafilesystemprovidingreproducibilitythroughprovenanceandautomation
AT varkielizabeth repeatfsafilesystemprovidingreproducibilitythroughprovenanceandautomation
AT thomaswkelley repeatfsafilesystemprovidingreproducibilitythroughprovenanceandautomation