Cargando…

Evaluating the utility of synthetic COVID-19 case data

BACKGROUND: Concerns about patient privacy have limited access to COVID-19 datasets. Data synthesis is one approach for making such data broadly available to the research community in a privacy protective manner. OBJECTIVES: Evaluate the utility of synthetic data by comparing analysis results betwee...

Descripción completa

Detalles Bibliográficos
Autores principales: El Emam, Khaled, Mosquera, Lucy, Jonker, Elizabeth, Sood, Harpreet
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7936723/
https://www.ncbi.nlm.nih.gov/pubmed/33709065
http://dx.doi.org/10.1093/jamiaopen/ooab012
_version_ 1783661245328523264
author El Emam, Khaled
Mosquera, Lucy
Jonker, Elizabeth
Sood, Harpreet
author_facet El Emam, Khaled
Mosquera, Lucy
Jonker, Elizabeth
Sood, Harpreet
author_sort El Emam, Khaled
collection PubMed
description BACKGROUND: Concerns about patient privacy have limited access to COVID-19 datasets. Data synthesis is one approach for making such data broadly available to the research community in a privacy protective manner. OBJECTIVES: Evaluate the utility of synthetic data by comparing analysis results between real and synthetic data. METHODS: A gradient boosted classification tree was built to predict death using Ontario’s 90 514 COVID-19 case records linked with community comorbidity, demographic, and socioeconomic characteristics. Model accuracy and relationships were evaluated, as well as privacy risks. The same model was developed on a synthesized dataset and compared to one from the original data. RESULTS: The AUROC and AUPRC for the real data model were 0.945 [95% confidence interval (CI), 0.941–0.948] and 0.34 (95% CI, 0.313–0.368), respectively. The synthetic data model had AUROC and AUPRC of 0.94 (95% CI, 0.936–0.944) and 0.313 (95% CI, 0.286–0.342) with confidence interval overlap of 45.05% and 52.02% when compared with the real data. The most important predictors of death for the real and synthetic models were in descending order: age, days since January 1, 2020, type of exposure, and gender. The functional relationships were similar between the two data sets. Attribute disclosure risks were 0.0585, and membership disclosure risk was low. CONCLUSIONS: This synthetic dataset could be used as a proxy for the real dataset.
format Online
Article
Text
id pubmed-7936723
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-79367232021-03-10 Evaluating the utility of synthetic COVID-19 case data El Emam, Khaled Mosquera, Lucy Jonker, Elizabeth Sood, Harpreet JAMIA Open Research and Applications BACKGROUND: Concerns about patient privacy have limited access to COVID-19 datasets. Data synthesis is one approach for making such data broadly available to the research community in a privacy protective manner. OBJECTIVES: Evaluate the utility of synthetic data by comparing analysis results between real and synthetic data. METHODS: A gradient boosted classification tree was built to predict death using Ontario’s 90 514 COVID-19 case records linked with community comorbidity, demographic, and socioeconomic characteristics. Model accuracy and relationships were evaluated, as well as privacy risks. The same model was developed on a synthesized dataset and compared to one from the original data. RESULTS: The AUROC and AUPRC for the real data model were 0.945 [95% confidence interval (CI), 0.941–0.948] and 0.34 (95% CI, 0.313–0.368), respectively. The synthetic data model had AUROC and AUPRC of 0.94 (95% CI, 0.936–0.944) and 0.313 (95% CI, 0.286–0.342) with confidence interval overlap of 45.05% and 52.02% when compared with the real data. The most important predictors of death for the real and synthetic models were in descending order: age, days since January 1, 2020, type of exposure, and gender. The functional relationships were similar between the two data sets. Attribute disclosure risks were 0.0585, and membership disclosure risk was low. CONCLUSIONS: This synthetic dataset could be used as a proxy for the real dataset. Oxford University Press 2021-03-01 /pmc/articles/PMC7936723/ /pubmed/33709065 http://dx.doi.org/10.1093/jamiaopen/ooab012 Text en © The Author(s) 2021. Published by Oxford University Press on behalf of the American Medical Informatics Association. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Research and Applications
El Emam, Khaled
Mosquera, Lucy
Jonker, Elizabeth
Sood, Harpreet
Evaluating the utility of synthetic COVID-19 case data
title Evaluating the utility of synthetic COVID-19 case data
title_full Evaluating the utility of synthetic COVID-19 case data
title_fullStr Evaluating the utility of synthetic COVID-19 case data
title_full_unstemmed Evaluating the utility of synthetic COVID-19 case data
title_short Evaluating the utility of synthetic COVID-19 case data
title_sort evaluating the utility of synthetic covid-19 case data
topic Research and Applications
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7936723/
https://www.ncbi.nlm.nih.gov/pubmed/33709065
http://dx.doi.org/10.1093/jamiaopen/ooab012
work_keys_str_mv AT elemamkhaled evaluatingtheutilityofsyntheticcovid19casedata
AT mosqueralucy evaluatingtheutilityofsyntheticcovid19casedata
AT jonkerelizabeth evaluatingtheutilityofsyntheticcovid19casedata
AT soodharpreet evaluatingtheutilityofsyntheticcovid19casedata