Cargando…

Reproducible big data science: A case study in continuous FAIRness

Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code thr...

Descripción completa

Detalles Bibliográficos
Autores principales: Madduri, Ravi, Chard, Kyle, D’Arcy, Mike, Jung, Segun C., Rodriguez, Alexis, Sulakhe, Dinanath, Deutsch, Eric, Funk, Cory, Heavner, Ben, Richards, Matthew, Shannon, Paul, Glusman, Gustavo, Price, Nathan, Kesselman, Carl, Foster, Ian
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6459504/
https://www.ncbi.nlm.nih.gov/pubmed/30973881
http://dx.doi.org/10.1371/journal.pone.0213013
_version_ 1783410190626848768
author Madduri, Ravi
Chard, Kyle
D’Arcy, Mike
Jung, Segun C.
Rodriguez, Alexis
Sulakhe, Dinanath
Deutsch, Eric
Funk, Cory
Heavner, Ben
Richards, Matthew
Shannon, Paul
Glusman, Gustavo
Price, Nathan
Kesselman, Carl
Foster, Ian
author_facet Madduri, Ravi
Chard, Kyle
D’Arcy, Mike
Jung, Segun C.
Rodriguez, Alexis
Sulakhe, Dinanath
Deutsch, Eric
Funk, Cory
Heavner, Ben
Richards, Matthew
Shannon, Paul
Glusman, Gustavo
Price, Nathan
Kesselman, Carl
Foster, Ian
author_sort Madduri, Ravi
collection PubMed
description Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility—thus ensuring that big data are not hard-to-(re)use data. We evaluate our approach via a user study, and show that 91% of participants were able to replicate a complex analysis involving considerable data volumes.
format Online
Article
Text
id pubmed-6459504
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-64595042019-05-03 Reproducible big data science: A case study in continuous FAIRness Madduri, Ravi Chard, Kyle D’Arcy, Mike Jung, Segun C. Rodriguez, Alexis Sulakhe, Dinanath Deutsch, Eric Funk, Cory Heavner, Ben Richards, Matthew Shannon, Paul Glusman, Gustavo Price, Nathan Kesselman, Carl Foster, Ian PLoS One Research Article Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility—thus ensuring that big data are not hard-to-(re)use data. We evaluate our approach via a user study, and show that 91% of participants were able to replicate a complex analysis involving considerable data volumes. Public Library of Science 2019-04-11 /pmc/articles/PMC6459504/ /pubmed/30973881 http://dx.doi.org/10.1371/journal.pone.0213013 Text en © 2019 Madduri et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Madduri, Ravi
Chard, Kyle
D’Arcy, Mike
Jung, Segun C.
Rodriguez, Alexis
Sulakhe, Dinanath
Deutsch, Eric
Funk, Cory
Heavner, Ben
Richards, Matthew
Shannon, Paul
Glusman, Gustavo
Price, Nathan
Kesselman, Carl
Foster, Ian
Reproducible big data science: A case study in continuous FAIRness
title Reproducible big data science: A case study in continuous FAIRness
title_full Reproducible big data science: A case study in continuous FAIRness
title_fullStr Reproducible big data science: A case study in continuous FAIRness
title_full_unstemmed Reproducible big data science: A case study in continuous FAIRness
title_short Reproducible big data science: A case study in continuous FAIRness
title_sort reproducible big data science: a case study in continuous fairness
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6459504/
https://www.ncbi.nlm.nih.gov/pubmed/30973881
http://dx.doi.org/10.1371/journal.pone.0213013
work_keys_str_mv AT madduriravi reproduciblebigdatascienceacasestudyincontinuousfairness
AT chardkyle reproduciblebigdatascienceacasestudyincontinuousfairness
AT darcymike reproduciblebigdatascienceacasestudyincontinuousfairness
AT jungsegunc reproduciblebigdatascienceacasestudyincontinuousfairness
AT rodriguezalexis reproduciblebigdatascienceacasestudyincontinuousfairness
AT sulakhedinanath reproduciblebigdatascienceacasestudyincontinuousfairness
AT deutscheric reproduciblebigdatascienceacasestudyincontinuousfairness
AT funkcory reproduciblebigdatascienceacasestudyincontinuousfairness
AT heavnerben reproduciblebigdatascienceacasestudyincontinuousfairness
AT richardsmatthew reproduciblebigdatascienceacasestudyincontinuousfairness
AT shannonpaul reproduciblebigdatascienceacasestudyincontinuousfairness
AT glusmangustavo reproduciblebigdatascienceacasestudyincontinuousfairness
AT pricenathan reproduciblebigdatascienceacasestudyincontinuousfairness
AT kesselmancarl reproduciblebigdatascienceacasestudyincontinuousfairness
AT fosterian reproduciblebigdatascienceacasestudyincontinuousfairness