Cargando…

Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments

In the past decade, there has been exponentially growing interest in the use of observational data collected as a part of routine healthcare practice to determine the effect of a treatment with causal inference models. Validation of these models, however, has been a challenge because the ground trut...

Descripción completa

Detalles Bibliográficos
Autores principales: Shi, Jingpu, Wang, Dong, Tesei, Gino, Norgeot, Beau
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9515575/
https://www.ncbi.nlm.nih.gov/pubmed/36187323
http://dx.doi.org/10.3389/frai.2022.918813
_version_ 1784798514711101440
author Shi, Jingpu
Wang, Dong
Tesei, Gino
Norgeot, Beau
author_facet Shi, Jingpu
Wang, Dong
Tesei, Gino
Norgeot, Beau
author_sort Shi, Jingpu
collection PubMed
description In the past decade, there has been exponentially growing interest in the use of observational data collected as a part of routine healthcare practice to determine the effect of a treatment with causal inference models. Validation of these models, however, has been a challenge because the ground truth is unknown: only one treatment-outcome pair for each person can be observed. There have been multiple efforts to fill this void using synthetic data where the ground truth can be generated. However, to date, these datasets have been severely limited in their utility either by being modeled after small non-representative patient populations, being dissimilar to real target populations, or only providing known effects for two cohorts (treated vs. control). In this work, we produced a large-scale and realistic synthetic dataset that provides ground truth effects for over 10 hypertension treatments on blood pressure outcomes. The synthetic dataset was created by modeling a nationwide cohort of more than 580, 000 hypertension patient data including each person's multi-year history of diagnoses, medications, and laboratory values. We designed a data generation process by combining an adapted ADS-GAN model for fictitious patient information generation and a neural network for treatment outcome generation. Wasserstein distance of 0.35 demonstrates that our synthetic data follows a nearly identical joint distribution to the patient cohort used to generate the data. Patient privacy was a primary concern for this study; the ϵ-identifiability metric, which estimates the probability of actual patients being identified, is 0.008%, ensuring that our synthetic data cannot be used to identify any actual patients. To demonstrate its usage, we tested the bias in causal effect estimation of four well-established models using this dataset. The approach we used can be readily extended to other types of diseases in the clinical domain, and to datasets in other domains as well.
format Online
Article
Text
id pubmed-9515575
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-95155752022-09-29 Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments Shi, Jingpu Wang, Dong Tesei, Gino Norgeot, Beau Front Artif Intell Artificial Intelligence In the past decade, there has been exponentially growing interest in the use of observational data collected as a part of routine healthcare practice to determine the effect of a treatment with causal inference models. Validation of these models, however, has been a challenge because the ground truth is unknown: only one treatment-outcome pair for each person can be observed. There have been multiple efforts to fill this void using synthetic data where the ground truth can be generated. However, to date, these datasets have been severely limited in their utility either by being modeled after small non-representative patient populations, being dissimilar to real target populations, or only providing known effects for two cohorts (treated vs. control). In this work, we produced a large-scale and realistic synthetic dataset that provides ground truth effects for over 10 hypertension treatments on blood pressure outcomes. The synthetic dataset was created by modeling a nationwide cohort of more than 580, 000 hypertension patient data including each person's multi-year history of diagnoses, medications, and laboratory values. We designed a data generation process by combining an adapted ADS-GAN model for fictitious patient information generation and a neural network for treatment outcome generation. Wasserstein distance of 0.35 demonstrates that our synthetic data follows a nearly identical joint distribution to the patient cohort used to generate the data. Patient privacy was a primary concern for this study; the ϵ-identifiability metric, which estimates the probability of actual patients being identified, is 0.008%, ensuring that our synthetic data cannot be used to identify any actual patients. To demonstrate its usage, we tested the bias in causal effect estimation of four well-established models using this dataset. The approach we used can be readily extended to other types of diseases in the clinical domain, and to datasets in other domains as well. Frontiers Media S.A. 2022-09-14 /pmc/articles/PMC9515575/ /pubmed/36187323 http://dx.doi.org/10.3389/frai.2022.918813 Text en Copyright © 2022 Shi, Wang, Tesei and Norgeot. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Artificial Intelligence
Shi, Jingpu
Wang, Dong
Tesei, Gino
Norgeot, Beau
Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments
title Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments
title_full Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments
title_fullStr Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments
title_full_unstemmed Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments
title_short Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments
title_sort generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments
topic Artificial Intelligence
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9515575/
https://www.ncbi.nlm.nih.gov/pubmed/36187323
http://dx.doi.org/10.3389/frai.2022.918813
work_keys_str_mv AT shijingpu generatinghighfidelityprivacyconscioussyntheticpatientdataforcausaleffectestimationwithmultipletreatments
AT wangdong generatinghighfidelityprivacyconscioussyntheticpatientdataforcausaleffectestimationwithmultipletreatments
AT teseigino generatinghighfidelityprivacyconscioussyntheticpatientdataforcausaleffectestimationwithmultipletreatments
AT norgeotbeau generatinghighfidelityprivacyconscioussyntheticpatientdataforcausaleffectestimationwithmultipletreatments