Cargando…

A method for generating synthetic longitudinal health data

Getting access to administrative health data for research purposes is a difficult and time-consuming process due to increasingly demanding privacy regulations. An alternative method for sharing administrative health data would be to share synthetic datasets where the records do not correspond to rea...

Descripción completa

Detalles Bibliográficos
Autores principales: Mosquera, Lucy, El Emam, Khaled, Ding, Lei, Sharma, Vishal, Zhang, Xue Hua, Kababji, Samer El, Carvalho, Chris, Hamilton, Brian, Palfrey, Dan, Kong, Linglong, Jiang, Bei, Eurich, Dean T.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10034254/
https://www.ncbi.nlm.nih.gov/pubmed/36959532
http://dx.doi.org/10.1186/s12874-023-01869-w
_version_ 1784911173432377344
author Mosquera, Lucy
El Emam, Khaled
Ding, Lei
Sharma, Vishal
Zhang, Xue Hua
Kababji, Samer El
Carvalho, Chris
Hamilton, Brian
Palfrey, Dan
Kong, Linglong
Jiang, Bei
Eurich, Dean T.
author_facet Mosquera, Lucy
El Emam, Khaled
Ding, Lei
Sharma, Vishal
Zhang, Xue Hua
Kababji, Samer El
Carvalho, Chris
Hamilton, Brian
Palfrey, Dan
Kong, Linglong
Jiang, Bei
Eurich, Dean T.
author_sort Mosquera, Lucy
collection PubMed
description Getting access to administrative health data for research purposes is a difficult and time-consuming process due to increasingly demanding privacy regulations. An alternative method for sharing administrative health data would be to share synthetic datasets where the records do not correspond to real individuals, but the patterns and relationships seen in the data are reproduced. This paper assesses the feasibility of generating synthetic administrative health data using a recurrent deep learning model. Our data comes from 120,000 individuals from Alberta Health’s administrative health database. We assess how similar our synthetic data is to the real data using utility assessments that assess the structure and general patterns in the data as well as by recreating a specific analysis in the real data commonly applied to this type of administrative health data. We also assess the privacy risks associated with the use of this synthetic dataset. Generic utility assessments that used Hellinger distance to quantify the difference in distributions between real and synthetic datasets for event types (0.027), attributes (mean 0.0417), Markov transition matrices (order 1 mean absolute difference: 0.0896, sd: 0.159; order 2: mean Hellinger distance 0.2195, sd: 0.2724), the Hellinger distance between the joint distributions was 0.352, and the similarity of random cohorts generated from real and synthetic data had a mean Hellinger distance of 0.3 and mean Euclidean distance of 0.064, indicating small differences between the distributions in the real data and the synthetic data. By applying a realistic analysis to both real and synthetic datasets, Cox regression hazard ratios achieved a mean confidence interval overlap of 68% for adjusted hazard ratios among 5 key outcomes of interest, indicating synthetic data produces similar analytic results to real data. The privacy assessment concluded that the attribution disclosure risk associated with this synthetic dataset was substantially less than the typical 0.09 acceptable risk threshold. Based on these metrics our results show that our synthetic data is suitably similar to the real data and could be shared for research purposes thereby alleviating concerns associated with the sharing of real data in some circumstances. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12874-023-01869-w.
format Online
Article
Text
id pubmed-10034254
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-100342542023-03-23 A method for generating synthetic longitudinal health data Mosquera, Lucy El Emam, Khaled Ding, Lei Sharma, Vishal Zhang, Xue Hua Kababji, Samer El Carvalho, Chris Hamilton, Brian Palfrey, Dan Kong, Linglong Jiang, Bei Eurich, Dean T. BMC Med Res Methodol Research Getting access to administrative health data for research purposes is a difficult and time-consuming process due to increasingly demanding privacy regulations. An alternative method for sharing administrative health data would be to share synthetic datasets where the records do not correspond to real individuals, but the patterns and relationships seen in the data are reproduced. This paper assesses the feasibility of generating synthetic administrative health data using a recurrent deep learning model. Our data comes from 120,000 individuals from Alberta Health’s administrative health database. We assess how similar our synthetic data is to the real data using utility assessments that assess the structure and general patterns in the data as well as by recreating a specific analysis in the real data commonly applied to this type of administrative health data. We also assess the privacy risks associated with the use of this synthetic dataset. Generic utility assessments that used Hellinger distance to quantify the difference in distributions between real and synthetic datasets for event types (0.027), attributes (mean 0.0417), Markov transition matrices (order 1 mean absolute difference: 0.0896, sd: 0.159; order 2: mean Hellinger distance 0.2195, sd: 0.2724), the Hellinger distance between the joint distributions was 0.352, and the similarity of random cohorts generated from real and synthetic data had a mean Hellinger distance of 0.3 and mean Euclidean distance of 0.064, indicating small differences between the distributions in the real data and the synthetic data. By applying a realistic analysis to both real and synthetic datasets, Cox regression hazard ratios achieved a mean confidence interval overlap of 68% for adjusted hazard ratios among 5 key outcomes of interest, indicating synthetic data produces similar analytic results to real data. The privacy assessment concluded that the attribution disclosure risk associated with this synthetic dataset was substantially less than the typical 0.09 acceptable risk threshold. Based on these metrics our results show that our synthetic data is suitably similar to the real data and could be shared for research purposes thereby alleviating concerns associated with the sharing of real data in some circumstances. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12874-023-01869-w. BioMed Central 2023-03-23 /pmc/articles/PMC10034254/ /pubmed/36959532 http://dx.doi.org/10.1186/s12874-023-01869-w Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Mosquera, Lucy
El Emam, Khaled
Ding, Lei
Sharma, Vishal
Zhang, Xue Hua
Kababji, Samer El
Carvalho, Chris
Hamilton, Brian
Palfrey, Dan
Kong, Linglong
Jiang, Bei
Eurich, Dean T.
A method for generating synthetic longitudinal health data
title A method for generating synthetic longitudinal health data
title_full A method for generating synthetic longitudinal health data
title_fullStr A method for generating synthetic longitudinal health data
title_full_unstemmed A method for generating synthetic longitudinal health data
title_short A method for generating synthetic longitudinal health data
title_sort method for generating synthetic longitudinal health data
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10034254/
https://www.ncbi.nlm.nih.gov/pubmed/36959532
http://dx.doi.org/10.1186/s12874-023-01869-w
work_keys_str_mv AT mosqueralucy amethodforgeneratingsyntheticlongitudinalhealthdata
AT elemamkhaled amethodforgeneratingsyntheticlongitudinalhealthdata
AT dinglei amethodforgeneratingsyntheticlongitudinalhealthdata
AT sharmavishal amethodforgeneratingsyntheticlongitudinalhealthdata
AT zhangxuehua amethodforgeneratingsyntheticlongitudinalhealthdata
AT kababjisamerel amethodforgeneratingsyntheticlongitudinalhealthdata
AT carvalhochris amethodforgeneratingsyntheticlongitudinalhealthdata
AT hamiltonbrian amethodforgeneratingsyntheticlongitudinalhealthdata
AT palfreydan amethodforgeneratingsyntheticlongitudinalhealthdata
AT konglinglong amethodforgeneratingsyntheticlongitudinalhealthdata
AT jiangbei amethodforgeneratingsyntheticlongitudinalhealthdata
AT eurichdeant amethodforgeneratingsyntheticlongitudinalhealthdata
AT mosqueralucy methodforgeneratingsyntheticlongitudinalhealthdata
AT elemamkhaled methodforgeneratingsyntheticlongitudinalhealthdata
AT dinglei methodforgeneratingsyntheticlongitudinalhealthdata
AT sharmavishal methodforgeneratingsyntheticlongitudinalhealthdata
AT zhangxuehua methodforgeneratingsyntheticlongitudinalhealthdata
AT kababjisamerel methodforgeneratingsyntheticlongitudinalhealthdata
AT carvalhochris methodforgeneratingsyntheticlongitudinalhealthdata
AT hamiltonbrian methodforgeneratingsyntheticlongitudinalhealthdata
AT palfreydan methodforgeneratingsyntheticlongitudinalhealthdata
AT konglinglong methodforgeneratingsyntheticlongitudinalhealthdata
AT jiangbei methodforgeneratingsyntheticlongitudinalhealthdata
AT eurichdeant methodforgeneratingsyntheticlongitudinalhealthdata