Cargando…

The Problem of Fairness in Synthetic Healthcare Data

Access to healthcare data such as electronic health records (EHR) is often restricted by laws established to protect patient privacy. These restrictions hinder the reproducibility of existing results based on private healthcare data and also limit new research. Synthetically-generated healthcare dat...

Descripción completa

Detalles Bibliográficos
Autores principales: Bhanot, Karan, Qi, Miao, Erickson, John S., Guyon, Isabelle, Bennett, Kristin P.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8468495/
https://www.ncbi.nlm.nih.gov/pubmed/34573790
http://dx.doi.org/10.3390/e23091165
_version_ 1784573684685471744
author Bhanot, Karan
Qi, Miao
Erickson, John S.
Guyon, Isabelle
Bennett, Kristin P.
author_facet Bhanot, Karan
Qi, Miao
Erickson, John S.
Guyon, Isabelle
Bennett, Kristin P.
author_sort Bhanot, Karan
collection PubMed
description Access to healthcare data such as electronic health records (EHR) is often restricted by laws established to protect patient privacy. These restrictions hinder the reproducibility of existing results based on private healthcare data and also limit new research. Synthetically-generated healthcare data solve this problem by preserving privacy and enabling researchers and policymakers to drive decisions and methods based on realistic data. Healthcare data can include information about multiple in- and out- patient visits of patients, making it a time-series dataset which is often influenced by protected attributes like age, gender, race etc. The COVID-19 pandemic has exacerbated health inequities, with certain subgroups experiencing poorer outcomes and less access to healthcare. To combat these inequities, synthetic data must “fairly” represent diverse minority subgroups such that the conclusions drawn on synthetic data are correct and the results can be generalized to real data. In this article, we develop two fairness metrics for synthetic data, and analyze all subgroups defined by protected attributes to analyze the bias in three published synthetic research datasets. These covariate-level disparity metrics revealed that synthetic data may not be representative at the univariate and multivariate subgroup-levels and thus, fairness should be addressed when developing data generation methods. We discuss the need for measuring fairness in synthetic healthcare data to enable the development of robust machine learning models to create more equitable synthetic healthcare datasets.
format Online
Article
Text
id pubmed-8468495
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-84684952021-09-27 The Problem of Fairness in Synthetic Healthcare Data Bhanot, Karan Qi, Miao Erickson, John S. Guyon, Isabelle Bennett, Kristin P. Entropy (Basel) Article Access to healthcare data such as electronic health records (EHR) is often restricted by laws established to protect patient privacy. These restrictions hinder the reproducibility of existing results based on private healthcare data and also limit new research. Synthetically-generated healthcare data solve this problem by preserving privacy and enabling researchers and policymakers to drive decisions and methods based on realistic data. Healthcare data can include information about multiple in- and out- patient visits of patients, making it a time-series dataset which is often influenced by protected attributes like age, gender, race etc. The COVID-19 pandemic has exacerbated health inequities, with certain subgroups experiencing poorer outcomes and less access to healthcare. To combat these inequities, synthetic data must “fairly” represent diverse minority subgroups such that the conclusions drawn on synthetic data are correct and the results can be generalized to real data. In this article, we develop two fairness metrics for synthetic data, and analyze all subgroups defined by protected attributes to analyze the bias in three published synthetic research datasets. These covariate-level disparity metrics revealed that synthetic data may not be representative at the univariate and multivariate subgroup-levels and thus, fairness should be addressed when developing data generation methods. We discuss the need for measuring fairness in synthetic healthcare data to enable the development of robust machine learning models to create more equitable synthetic healthcare datasets. MDPI 2021-09-04 /pmc/articles/PMC8468495/ /pubmed/34573790 http://dx.doi.org/10.3390/e23091165 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Bhanot, Karan
Qi, Miao
Erickson, John S.
Guyon, Isabelle
Bennett, Kristin P.
The Problem of Fairness in Synthetic Healthcare Data
title The Problem of Fairness in Synthetic Healthcare Data
title_full The Problem of Fairness in Synthetic Healthcare Data
title_fullStr The Problem of Fairness in Synthetic Healthcare Data
title_full_unstemmed The Problem of Fairness in Synthetic Healthcare Data
title_short The Problem of Fairness in Synthetic Healthcare Data
title_sort problem of fairness in synthetic healthcare data
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8468495/
https://www.ncbi.nlm.nih.gov/pubmed/34573790
http://dx.doi.org/10.3390/e23091165
work_keys_str_mv AT bhanotkaran theproblemoffairnessinsynthetichealthcaredata
AT qimiao theproblemoffairnessinsynthetichealthcaredata
AT ericksonjohns theproblemoffairnessinsynthetichealthcaredata
AT guyonisabelle theproblemoffairnessinsynthetichealthcaredata
AT bennettkristinp theproblemoffairnessinsynthetichealthcaredata
AT bhanotkaran problemoffairnessinsynthetichealthcaredata
AT qimiao problemoffairnessinsynthetichealthcaredata
AT ericksonjohns problemoffairnessinsynthetichealthcaredata
AT guyonisabelle problemoffairnessinsynthetichealthcaredata
AT bennettkristinp problemoffairnessinsynthetichealthcaredata