Cargando…
RepeatFS: a file system providing reproducibility through provenance and automation
MOTIVATION: Reproducibility is of central importance to the scientific process. The difficulty of consistently replicating and verifying experimental results is magnified in the era of big data, in which bioinformatics analysis often involves complex multi-application pipelines operating on terabyte...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8189677/ https://www.ncbi.nlm.nih.gov/pubmed/33230554 http://dx.doi.org/10.1093/bioinformatics/btaa950 |
_version_ | 1783705534935859200 |
---|---|
author | Westbrook, Anthony Varki, Elizabeth Thomas, W Kelley |
author_facet | Westbrook, Anthony Varki, Elizabeth Thomas, W Kelley |
author_sort | Westbrook, Anthony |
collection | PubMed |
description | MOTIVATION: Reproducibility is of central importance to the scientific process. The difficulty of consistently replicating and verifying experimental results is magnified in the era of big data, in which bioinformatics analysis often involves complex multi-application pipelines operating on terabytes of data. These processes result in thousands of possible permutations of data preparation steps, software versions and command-line arguments. Existing reproducibility frameworks are cumbersome and involve redesigning computational methods. To address these issues, we developed RepeatFS, a file system that records, replicates and verifies informatics workflows with no alteration to the original methods. RepeatFS also provides several other features to help promote analytical transparency and reproducibility, including provenance visualization and task automation. RESULTS: We used RepeatFS to successfully visualize and replicate a variety of bioinformatics tasks consisting of over a million operations with no alteration to the original methods. RepeatFS correctly identified all software inconsistencies that resulted in replication differences. AVAILABILITYAND IMPLEMENTATION: RepeatFS is implemented in Python 3. Its source code and documentation are available at https://github.com/ToniWestbrook/repeatfs. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. |
format | Online Article Text |
id | pubmed-8189677 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-81896772021-06-10 RepeatFS: a file system providing reproducibility through provenance and automation Westbrook, Anthony Varki, Elizabeth Thomas, W Kelley Bioinformatics Original Papers MOTIVATION: Reproducibility is of central importance to the scientific process. The difficulty of consistently replicating and verifying experimental results is magnified in the era of big data, in which bioinformatics analysis often involves complex multi-application pipelines operating on terabytes of data. These processes result in thousands of possible permutations of data preparation steps, software versions and command-line arguments. Existing reproducibility frameworks are cumbersome and involve redesigning computational methods. To address these issues, we developed RepeatFS, a file system that records, replicates and verifies informatics workflows with no alteration to the original methods. RepeatFS also provides several other features to help promote analytical transparency and reproducibility, including provenance visualization and task automation. RESULTS: We used RepeatFS to successfully visualize and replicate a variety of bioinformatics tasks consisting of over a million operations with no alteration to the original methods. RepeatFS correctly identified all software inconsistencies that resulted in replication differences. AVAILABILITYAND IMPLEMENTATION: RepeatFS is implemented in Python 3. Its source code and documentation are available at https://github.com/ToniWestbrook/repeatfs. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2020-11-24 /pmc/articles/PMC8189677/ /pubmed/33230554 http://dx.doi.org/10.1093/bioinformatics/btaa950 Text en © The Author(s) 2020. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) ), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Papers Westbrook, Anthony Varki, Elizabeth Thomas, W Kelley RepeatFS: a file system providing reproducibility through provenance and automation |
title | RepeatFS: a file system providing reproducibility through provenance and automation |
title_full | RepeatFS: a file system providing reproducibility through provenance and automation |
title_fullStr | RepeatFS: a file system providing reproducibility through provenance and automation |
title_full_unstemmed | RepeatFS: a file system providing reproducibility through provenance and automation |
title_short | RepeatFS: a file system providing reproducibility through provenance and automation |
title_sort | repeatfs: a file system providing reproducibility through provenance and automation |
topic | Original Papers |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8189677/ https://www.ncbi.nlm.nih.gov/pubmed/33230554 http://dx.doi.org/10.1093/bioinformatics/btaa950 |
work_keys_str_mv | AT westbrookanthony repeatfsafilesystemprovidingreproducibilitythroughprovenanceandautomation AT varkielizabeth repeatfsafilesystemprovidingreproducibilitythroughprovenanceandautomation AT thomaswkelley repeatfsafilesystemprovidingreproducibilitythroughprovenanceandautomation |