Cargando…

Veridical data science

Building and expanding on principles of statistics, machine learning, and scientific inquiry, we propose the predictability, computability, and stability (PCS) framework for veridical data science. Our framework, composed of both a workflow and documentation, aims to provide responsible, reliable, r...

Descripción completa

Detalles Bibliográficos
Autores principales: Yu, Bin, Kumbier, Karl
Formato: Online Artículo Texto
Lenguaje:English
Publicado: National Academy of Sciences 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7049126/
https://www.ncbi.nlm.nih.gov/pubmed/32054788
http://dx.doi.org/10.1073/pnas.1901326117
_version_ 1783502381832470528
author Yu, Bin
Kumbier, Karl
author_facet Yu, Bin
Kumbier, Karl
author_sort Yu, Bin
collection PubMed
description Building and expanding on principles of statistics, machine learning, and scientific inquiry, we propose the predictability, computability, and stability (PCS) framework for veridical data science. Our framework, composed of both a workflow and documentation, aims to provide responsible, reliable, reproducible, and transparent results across the data science life cycle. The PCS workflow uses predictability as a reality check and considers the importance of computation in data collection/storage and algorithm design. It augments predictability and computability with an overarching stability principle. Stability expands on statistical uncertainty considerations to assess how human judgment calls impact data results through data and model/algorithm perturbations. As part of the PCS workflow, we develop PCS inference procedures, namely PCS perturbation intervals and PCS hypothesis testing, to investigate the stability of data results relative to problem formulation, data cleaning, modeling decisions, and interpretations. We illustrate PCS inference through neuroscience and genomics projects of our own and others. Moreover, we demonstrate its favorable performance over existing methods in terms of receiver operating characteristic (ROC) curves in high-dimensional, sparse linear model simulations, including a wide range of misspecified models. Finally, we propose PCS documentation based on R Markdown or Jupyter Notebook, with publicly available, reproducible codes and narratives to back up human choices made throughout an analysis. The PCS workflow and documentation are demonstrated in a genomics case study available on Zenodo.
format Online
Article
Text
id pubmed-7049126
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher National Academy of Sciences
record_format MEDLINE/PubMed
spelling pubmed-70491262020-03-06 Veridical data science Yu, Bin Kumbier, Karl Proc Natl Acad Sci U S A Physical Sciences Building and expanding on principles of statistics, machine learning, and scientific inquiry, we propose the predictability, computability, and stability (PCS) framework for veridical data science. Our framework, composed of both a workflow and documentation, aims to provide responsible, reliable, reproducible, and transparent results across the data science life cycle. The PCS workflow uses predictability as a reality check and considers the importance of computation in data collection/storage and algorithm design. It augments predictability and computability with an overarching stability principle. Stability expands on statistical uncertainty considerations to assess how human judgment calls impact data results through data and model/algorithm perturbations. As part of the PCS workflow, we develop PCS inference procedures, namely PCS perturbation intervals and PCS hypothesis testing, to investigate the stability of data results relative to problem formulation, data cleaning, modeling decisions, and interpretations. We illustrate PCS inference through neuroscience and genomics projects of our own and others. Moreover, we demonstrate its favorable performance over existing methods in terms of receiver operating characteristic (ROC) curves in high-dimensional, sparse linear model simulations, including a wide range of misspecified models. Finally, we propose PCS documentation based on R Markdown or Jupyter Notebook, with publicly available, reproducible codes and narratives to back up human choices made throughout an analysis. The PCS workflow and documentation are demonstrated in a genomics case study available on Zenodo. National Academy of Sciences 2020-02-25 2020-02-13 /pmc/articles/PMC7049126/ /pubmed/32054788 http://dx.doi.org/10.1073/pnas.1901326117 Text en Copyright © 2020 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND) (https://creativecommons.org/licenses/by-nc-nd/4.0/) .
spellingShingle Physical Sciences
Yu, Bin
Kumbier, Karl
Veridical data science
title Veridical data science
title_full Veridical data science
title_fullStr Veridical data science
title_full_unstemmed Veridical data science
title_short Veridical data science
title_sort veridical data science
topic Physical Sciences
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7049126/
https://www.ncbi.nlm.nih.gov/pubmed/32054788
http://dx.doi.org/10.1073/pnas.1901326117
work_keys_str_mv AT yubin veridicaldatascience
AT kumbierkarl veridicaldatascience