Cargando…

Causal Datasheet for Datasets: An Evaluation Guide for Real-World Data Analysis and Data Collection Design Using Bayesian Networks

Developing data-driven solutions that address real-world problems requires understanding of these problems’ causes and how their interaction affects the outcome–often with only observational data. Causal Bayesian Networks (BN) have been proposed as a powerful method for discovering and representing...

Descripción completa

Detalles Bibliográficos
Autores principales: Butcher, Bradley, Huang, Vincent S., Robinson, Christopher, Reffin, Jeremy, Sgaier, Sema K., Charles, Grace, Quadrianto, Novi
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8320747/
https://www.ncbi.nlm.nih.gov/pubmed/34337389
http://dx.doi.org/10.3389/frai.2021.612551
_version_ 1783730698380640256
author Butcher, Bradley
Huang, Vincent S.
Robinson, Christopher
Reffin, Jeremy
Sgaier, Sema K.
Charles, Grace
Quadrianto, Novi
author_facet Butcher, Bradley
Huang, Vincent S.
Robinson, Christopher
Reffin, Jeremy
Sgaier, Sema K.
Charles, Grace
Quadrianto, Novi
author_sort Butcher, Bradley
collection PubMed
description Developing data-driven solutions that address real-world problems requires understanding of these problems’ causes and how their interaction affects the outcome–often with only observational data. Causal Bayesian Networks (BN) have been proposed as a powerful method for discovering and representing the causal relationships from observational data as a Directed Acyclic Graph (DAG). BNs could be especially useful for research in global health in Lower and Middle Income Countries, where there is an increasing abundance of observational data that could be harnessed for policy making, program evaluation, and intervention design. However, BNs have not been widely adopted by global health professionals, and in real-world applications, confidence in the results of BNs generally remains inadequate. This is partially due to the inability to validate against some ground truth, as the true DAG is not available. This is especially problematic if a learned DAG conflicts with pre-existing domain doctrine. Here we conceptualize and demonstrate an idea of a “Causal Datasheet” that could approximate and document BN performance expectations for a given dataset, aiming to provide confidence and sample size requirements to practitioners. To generate results for such a Causal Datasheet, a tool was developed which can generate synthetic Bayesian networks and their associated synthetic datasets to mimic real-world datasets. The results given by well-known structure learning algorithms and a novel implementation of the OrderMCMC method using the Quotient Normalized Maximum Likelihood score were recorded. These results were used to populate the Causal Datasheet, and recommendations could be made dependent on whether expected performance met user-defined thresholds. We present our experience in the creation of Causal Datasheets to aid analysis decisions at different stages of the research process. First, one was deployed to help determine the appropriate sample size of a planned study of sexual and reproductive health in Madhya Pradesh, India. Second, a datasheet was created to estimate the performance of an existing maternal health survey we conducted in Uttar Pradesh, India. Third, we validated generated performance estimates and investigated current limitations on the well-known ALARM dataset. Our experience demonstrates the utility of the Causal Datasheet, which can help global health practitioners gain more confidence when applying BNs.
format Online
Article
Text
id pubmed-8320747
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-83207472021-07-30 Causal Datasheet for Datasets: An Evaluation Guide for Real-World Data Analysis and Data Collection Design Using Bayesian Networks Butcher, Bradley Huang, Vincent S. Robinson, Christopher Reffin, Jeremy Sgaier, Sema K. Charles, Grace Quadrianto, Novi Front Artif Intell Artificial Intelligence Developing data-driven solutions that address real-world problems requires understanding of these problems’ causes and how their interaction affects the outcome–often with only observational data. Causal Bayesian Networks (BN) have been proposed as a powerful method for discovering and representing the causal relationships from observational data as a Directed Acyclic Graph (DAG). BNs could be especially useful for research in global health in Lower and Middle Income Countries, where there is an increasing abundance of observational data that could be harnessed for policy making, program evaluation, and intervention design. However, BNs have not been widely adopted by global health professionals, and in real-world applications, confidence in the results of BNs generally remains inadequate. This is partially due to the inability to validate against some ground truth, as the true DAG is not available. This is especially problematic if a learned DAG conflicts with pre-existing domain doctrine. Here we conceptualize and demonstrate an idea of a “Causal Datasheet” that could approximate and document BN performance expectations for a given dataset, aiming to provide confidence and sample size requirements to practitioners. To generate results for such a Causal Datasheet, a tool was developed which can generate synthetic Bayesian networks and their associated synthetic datasets to mimic real-world datasets. The results given by well-known structure learning algorithms and a novel implementation of the OrderMCMC method using the Quotient Normalized Maximum Likelihood score were recorded. These results were used to populate the Causal Datasheet, and recommendations could be made dependent on whether expected performance met user-defined thresholds. We present our experience in the creation of Causal Datasheets to aid analysis decisions at different stages of the research process. First, one was deployed to help determine the appropriate sample size of a planned study of sexual and reproductive health in Madhya Pradesh, India. Second, a datasheet was created to estimate the performance of an existing maternal health survey we conducted in Uttar Pradesh, India. Third, we validated generated performance estimates and investigated current limitations on the well-known ALARM dataset. Our experience demonstrates the utility of the Causal Datasheet, which can help global health practitioners gain more confidence when applying BNs. Frontiers Media S.A. 2021-04-14 /pmc/articles/PMC8320747/ /pubmed/34337389 http://dx.doi.org/10.3389/frai.2021.612551 Text en Copyright © 2021 Butcher, Huang, Robinson, Reffin, Sgaier, Charles and Quadrianto. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Artificial Intelligence
Butcher, Bradley
Huang, Vincent S.
Robinson, Christopher
Reffin, Jeremy
Sgaier, Sema K.
Charles, Grace
Quadrianto, Novi
Causal Datasheet for Datasets: An Evaluation Guide for Real-World Data Analysis and Data Collection Design Using Bayesian Networks
title Causal Datasheet for Datasets: An Evaluation Guide for Real-World Data Analysis and Data Collection Design Using Bayesian Networks
title_full Causal Datasheet for Datasets: An Evaluation Guide for Real-World Data Analysis and Data Collection Design Using Bayesian Networks
title_fullStr Causal Datasheet for Datasets: An Evaluation Guide for Real-World Data Analysis and Data Collection Design Using Bayesian Networks
title_full_unstemmed Causal Datasheet for Datasets: An Evaluation Guide for Real-World Data Analysis and Data Collection Design Using Bayesian Networks
title_short Causal Datasheet for Datasets: An Evaluation Guide for Real-World Data Analysis and Data Collection Design Using Bayesian Networks
title_sort causal datasheet for datasets: an evaluation guide for real-world data analysis and data collection design using bayesian networks
topic Artificial Intelligence
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8320747/
https://www.ncbi.nlm.nih.gov/pubmed/34337389
http://dx.doi.org/10.3389/frai.2021.612551
work_keys_str_mv AT butcherbradley causaldatasheetfordatasetsanevaluationguideforrealworlddataanalysisanddatacollectiondesignusingbayesiannetworks
AT huangvincents causaldatasheetfordatasetsanevaluationguideforrealworlddataanalysisanddatacollectiondesignusingbayesiannetworks
AT robinsonchristopher causaldatasheetfordatasetsanevaluationguideforrealworlddataanalysisanddatacollectiondesignusingbayesiannetworks
AT reffinjeremy causaldatasheetfordatasetsanevaluationguideforrealworlddataanalysisanddatacollectiondesignusingbayesiannetworks
AT sgaiersemak causaldatasheetfordatasetsanevaluationguideforrealworlddataanalysisanddatacollectiondesignusingbayesiannetworks
AT charlesgrace causaldatasheetfordatasetsanevaluationguideforrealworlddataanalysisanddatacollectiondesignusingbayesiannetworks
AT quadriantonovi causaldatasheetfordatasetsanevaluationguideforrealworlddataanalysisanddatacollectiondesignusingbayesiannetworks