Cargando…
Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments
In the past decade, there has been exponentially growing interest in the use of observational data collected as a part of routine healthcare practice to determine the effect of a treatment with causal inference models. Validation of these models, however, has been a challenge because the ground trut...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Frontiers Media S.A.
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9515575/ https://www.ncbi.nlm.nih.gov/pubmed/36187323 http://dx.doi.org/10.3389/frai.2022.918813 |
_version_ | 1784798514711101440 |
---|---|
author | Shi, Jingpu Wang, Dong Tesei, Gino Norgeot, Beau |
author_facet | Shi, Jingpu Wang, Dong Tesei, Gino Norgeot, Beau |
author_sort | Shi, Jingpu |
collection | PubMed |
description | In the past decade, there has been exponentially growing interest in the use of observational data collected as a part of routine healthcare practice to determine the effect of a treatment with causal inference models. Validation of these models, however, has been a challenge because the ground truth is unknown: only one treatment-outcome pair for each person can be observed. There have been multiple efforts to fill this void using synthetic data where the ground truth can be generated. However, to date, these datasets have been severely limited in their utility either by being modeled after small non-representative patient populations, being dissimilar to real target populations, or only providing known effects for two cohorts (treated vs. control). In this work, we produced a large-scale and realistic synthetic dataset that provides ground truth effects for over 10 hypertension treatments on blood pressure outcomes. The synthetic dataset was created by modeling a nationwide cohort of more than 580, 000 hypertension patient data including each person's multi-year history of diagnoses, medications, and laboratory values. We designed a data generation process by combining an adapted ADS-GAN model for fictitious patient information generation and a neural network for treatment outcome generation. Wasserstein distance of 0.35 demonstrates that our synthetic data follows a nearly identical joint distribution to the patient cohort used to generate the data. Patient privacy was a primary concern for this study; the ϵ-identifiability metric, which estimates the probability of actual patients being identified, is 0.008%, ensuring that our synthetic data cannot be used to identify any actual patients. To demonstrate its usage, we tested the bias in causal effect estimation of four well-established models using this dataset. The approach we used can be readily extended to other types of diseases in the clinical domain, and to datasets in other domains as well. |
format | Online Article Text |
id | pubmed-9515575 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Frontiers Media S.A. |
record_format | MEDLINE/PubMed |
spelling | pubmed-95155752022-09-29 Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments Shi, Jingpu Wang, Dong Tesei, Gino Norgeot, Beau Front Artif Intell Artificial Intelligence In the past decade, there has been exponentially growing interest in the use of observational data collected as a part of routine healthcare practice to determine the effect of a treatment with causal inference models. Validation of these models, however, has been a challenge because the ground truth is unknown: only one treatment-outcome pair for each person can be observed. There have been multiple efforts to fill this void using synthetic data where the ground truth can be generated. However, to date, these datasets have been severely limited in their utility either by being modeled after small non-representative patient populations, being dissimilar to real target populations, or only providing known effects for two cohorts (treated vs. control). In this work, we produced a large-scale and realistic synthetic dataset that provides ground truth effects for over 10 hypertension treatments on blood pressure outcomes. The synthetic dataset was created by modeling a nationwide cohort of more than 580, 000 hypertension patient data including each person's multi-year history of diagnoses, medications, and laboratory values. We designed a data generation process by combining an adapted ADS-GAN model for fictitious patient information generation and a neural network for treatment outcome generation. Wasserstein distance of 0.35 demonstrates that our synthetic data follows a nearly identical joint distribution to the patient cohort used to generate the data. Patient privacy was a primary concern for this study; the ϵ-identifiability metric, which estimates the probability of actual patients being identified, is 0.008%, ensuring that our synthetic data cannot be used to identify any actual patients. To demonstrate its usage, we tested the bias in causal effect estimation of four well-established models using this dataset. The approach we used can be readily extended to other types of diseases in the clinical domain, and to datasets in other domains as well. Frontiers Media S.A. 2022-09-14 /pmc/articles/PMC9515575/ /pubmed/36187323 http://dx.doi.org/10.3389/frai.2022.918813 Text en Copyright © 2022 Shi, Wang, Tesei and Norgeot. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. |
spellingShingle | Artificial Intelligence Shi, Jingpu Wang, Dong Tesei, Gino Norgeot, Beau Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments |
title | Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments |
title_full | Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments |
title_fullStr | Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments |
title_full_unstemmed | Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments |
title_short | Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments |
title_sort | generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments |
topic | Artificial Intelligence |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9515575/ https://www.ncbi.nlm.nih.gov/pubmed/36187323 http://dx.doi.org/10.3389/frai.2022.918813 |
work_keys_str_mv | AT shijingpu generatinghighfidelityprivacyconscioussyntheticpatientdataforcausaleffectestimationwithmultipletreatments AT wangdong generatinghighfidelityprivacyconscioussyntheticpatientdataforcausaleffectestimationwithmultipletreatments AT teseigino generatinghighfidelityprivacyconscioussyntheticpatientdataforcausaleffectestimationwithmultipletreatments AT norgeotbeau generatinghighfidelityprivacyconscioussyntheticpatientdataforcausaleffectestimationwithmultipletreatments |