Cargando…

Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments

In the past decade, there has been exponentially growing interest in the use of observational data collected as a part of routine healthcare practice to determine the effect of a treatment with causal inference models. Validation of these models, however, has been a challenge because the ground trut...

Descripción completa

Detalles Bibliográficos
Autores principales:	Shi, Jingpu, Wang, Dong, Tesei, Gino, Norgeot, Beau
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2022
Materias:	Artificial Intelligence
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9515575/ https://www.ncbi.nlm.nih.gov/pubmed/36187323 http://dx.doi.org/10.3389/frai.2022.918813

_version_	1784798514711101440
author	Shi, Jingpu Wang, Dong Tesei, Gino Norgeot, Beau
author_facet	Shi, Jingpu Wang, Dong Tesei, Gino Norgeot, Beau
author_sort	Shi, Jingpu
collection	PubMed
description	In the past decade, there has been exponentially growing interest in the use of observational data collected as a part of routine healthcare practice to determine the effect of a treatment with causal inference models. Validation of these models, however, has been a challenge because the ground truth is unknown: only one treatment-outcome pair for each person can be observed. There have been multiple efforts to fill this void using synthetic data where the ground truth can be generated. However, to date, these datasets have been severely limited in their utility either by being modeled after small non-representative patient populations, being dissimilar to real target populations, or only providing known effects for two cohorts (treated vs. control). In this work, we produced a large-scale and realistic synthetic dataset that provides ground truth effects for over 10 hypertension treatments on blood pressure outcomes. The synthetic dataset was created by modeling a nationwide cohort of more than 580, 000 hypertension patient data including each person's multi-year history of diagnoses, medications, and laboratory values. We designed a data generation process by combining an adapted ADS-GAN model for fictitious patient information generation and a neural network for treatment outcome generation. Wasserstein distance of 0.35 demonstrates that our synthetic data follows a nearly identical joint distribution to the patient cohort used to generate the data. Patient privacy was a primary concern for this study; the ϵ-identifiability metric, which estimates the probability of actual patients being identified, is 0.008%, ensuring that our synthetic data cannot be used to identify any actual patients. To demonstrate its usage, we tested the bias in causal effect estimation of four well-established models using this dataset. The approach we used can be readily extended to other types of diseases in the clinical domain, and to datasets in other domains as well.
format	Online Article Text
id	pubmed-9515575
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-95155752022-09-29 Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments Shi, Jingpu Wang, Dong Tesei, Gino Norgeot, Beau Front Artif Intell Artificial Intelligence In the past decade, there has been exponentially growing interest in the use of observational data collected as a part of routine healthcare practice to determine the effect of a treatment with causal inference models. Validation of these models, however, has been a challenge because the ground truth is unknown: only one treatment-outcome pair for each person can be observed. There have been multiple efforts to fill this void using synthetic data where the ground truth can be generated. However, to date, these datasets have been severely limited in their utility either by being modeled after small non-representative patient populations, being dissimilar to real target populations, or only providing known effects for two cohorts (treated vs. control). In this work, we produced a large-scale and realistic synthetic dataset that provides ground truth effects for over 10 hypertension treatments on blood pressure outcomes. The synthetic dataset was created by modeling a nationwide cohort of more than 580, 000 hypertension patient data including each person's multi-year history of diagnoses, medications, and laboratory values. We designed a data generation process by combining an adapted ADS-GAN model for fictitious patient information generation and a neural network for treatment outcome generation. Wasserstein distance of 0.35 demonstrates that our synthetic data follows a nearly identical joint distribution to the patient cohort used to generate the data. Patient privacy was a primary concern for this study; the ϵ-identifiability metric, which estimates the probability of actual patients being identified, is 0.008%, ensuring that our synthetic data cannot be used to identify any actual patients. To demonstrate its usage, we tested the bias in causal effect estimation of four well-established models using this dataset. The approach we used can be readily extended to other types of diseases in the clinical domain, and to datasets in other domains as well. Frontiers Media S.A. 2022-09-14 /pmc/articles/PMC9515575/ /pubmed/36187323 http://dx.doi.org/10.3389/frai.2022.918813 Text en Copyright © 2022 Shi, Wang, Tesei and Norgeot. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Artificial Intelligence Shi, Jingpu Wang, Dong Tesei, Gino Norgeot, Beau Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments
title	Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments
title_full	Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments
title_fullStr	Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments
title_full_unstemmed	Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments
title_short	Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments
title_sort	generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments
topic	Artificial Intelligence
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9515575/ https://www.ncbi.nlm.nih.gov/pubmed/36187323 http://dx.doi.org/10.3389/frai.2022.918813
work_keys_str_mv	AT shijingpu generatinghighfidelityprivacyconscioussyntheticpatientdataforcausaleffectestimationwithmultipletreatments AT wangdong generatinghighfidelityprivacyconscioussyntheticpatientdataforcausaleffectestimationwithmultipletreatments AT teseigino generatinghighfidelityprivacyconscioussyntheticpatientdataforcausaleffectestimationwithmultipletreatments AT norgeotbeau generatinghighfidelityprivacyconscioussyntheticpatientdataforcausaleffectestimationwithmultipletreatments

Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments

Ejemplares similares