Cargando…

Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: Results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C)

OBJECTIVE: To evaluate whether synthetic data derived from a national COVID-19 data set could be used for geospatial and temporal epidemic analyses. MATERIALS AND METHODS: Using an original data set (n=1,854,968 SARS-CoV-2 tests) and its synthetic derivative, we compared key indicators of COVID-19 c...

Descripción completa

Detalles Bibliográficos
Autores principales: Thomas, Jason A., Foraker, Randi E., Zamstein, Noa, Payne, Philip R.O., Wilcox, Adam B.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8282114/
https://www.ncbi.nlm.nih.gov/pubmed/34268525
http://dx.doi.org/10.1101/2021.07.06.21259051
_version_ 1783722949888442368
author Thomas, Jason A.
Foraker, Randi E.
Zamstein, Noa
Payne, Philip R.O.
Wilcox, Adam B.
author_facet Thomas, Jason A.
Foraker, Randi E.
Zamstein, Noa
Payne, Philip R.O.
Wilcox, Adam B.
author_sort Thomas, Jason A.
collection PubMed
description OBJECTIVE: To evaluate whether synthetic data derived from a national COVID-19 data set could be used for geospatial and temporal epidemic analyses. MATERIALS AND METHODS: Using an original data set (n=1,854,968 SARS-CoV-2 tests) and its synthetic derivative, we compared key indicators of COVID-19 community spread through analysis of aggregate and zip-code level epidemic curves, patient characteristics and outcomes, distribution of tests by zip code, and indicator counts stratified by month and zip code. Similarity between the data was statistically and qualitatively evaluated. RESULTS: In general, synthetic data closely matched original data for epidemic curves, patient characteristics, and outcomes. Synthetic data suppressed labels of zip codes with few total tests (mean=2.9±2.4; max=16 tests; 66% reduction of unique zip codes). Epidemic curves and monthly indicator counts were similar between synthetic and original data in a random sample of the most tested (top 1%; n=171) and for all unsuppressed zip codes (n=5,819), respectively. In small sample sizes, synthetic data utility was notably decreased. DISCUSSION: Analyses on the population-level and of densely-tested zip codes (which contained most of the data) were similar between original and synthetically-derived data sets. Analyses of sparsely-tested populations were less similar and had more data suppression. CONCLUSION: In general, synthetic data were successfully used to analyze geospatial and temporal trends. Analyses using small sample sizes or populations were limited, in part due to purposeful data label suppression - an attribute disclosure countermeasure. Users should consider data fitness for use in these cases.
format Online
Article
Text
id pubmed-8282114
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Cold Spring Harbor Laboratory
record_format MEDLINE/PubMed
spelling pubmed-82821142021-07-16 Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: Results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C) Thomas, Jason A. Foraker, Randi E. Zamstein, Noa Payne, Philip R.O. Wilcox, Adam B. medRxiv Article OBJECTIVE: To evaluate whether synthetic data derived from a national COVID-19 data set could be used for geospatial and temporal epidemic analyses. MATERIALS AND METHODS: Using an original data set (n=1,854,968 SARS-CoV-2 tests) and its synthetic derivative, we compared key indicators of COVID-19 community spread through analysis of aggregate and zip-code level epidemic curves, patient characteristics and outcomes, distribution of tests by zip code, and indicator counts stratified by month and zip code. Similarity between the data was statistically and qualitatively evaluated. RESULTS: In general, synthetic data closely matched original data for epidemic curves, patient characteristics, and outcomes. Synthetic data suppressed labels of zip codes with few total tests (mean=2.9±2.4; max=16 tests; 66% reduction of unique zip codes). Epidemic curves and monthly indicator counts were similar between synthetic and original data in a random sample of the most tested (top 1%; n=171) and for all unsuppressed zip codes (n=5,819), respectively. In small sample sizes, synthetic data utility was notably decreased. DISCUSSION: Analyses on the population-level and of densely-tested zip codes (which contained most of the data) were similar between original and synthetically-derived data sets. Analyses of sparsely-tested populations were less similar and had more data suppression. CONCLUSION: In general, synthetic data were successfully used to analyze geospatial and temporal trends. Analyses using small sample sizes or populations were limited, in part due to purposeful data label suppression - an attribute disclosure countermeasure. Users should consider data fitness for use in these cases. Cold Spring Harbor Laboratory 2021-07-08 /pmc/articles/PMC8282114/ /pubmed/34268525 http://dx.doi.org/10.1101/2021.07.06.21259051 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.
spellingShingle Article
Thomas, Jason A.
Foraker, Randi E.
Zamstein, Noa
Payne, Philip R.O.
Wilcox, Adam B.
Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: Results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C)
title Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: Results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C)
title_full Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: Results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C)
title_fullStr Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: Results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C)
title_full_unstemmed Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: Results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C)
title_short Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: Results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C)
title_sort demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: results from analyzing >1.8 million sars-cov-2 tests in the united states national covid cohort collaborative (n3c)
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8282114/
https://www.ncbi.nlm.nih.gov/pubmed/34268525
http://dx.doi.org/10.1101/2021.07.06.21259051
work_keys_str_mv AT thomasjasona demonstratinganapproachforevaluatingsyntheticgeospatialandtemporalepidemiologicdatautilityresultsfromanalyzing18millionsarscov2testsintheunitedstatesnationalcovidcohortcollaborativen3c
AT forakerrandie demonstratinganapproachforevaluatingsyntheticgeospatialandtemporalepidemiologicdatautilityresultsfromanalyzing18millionsarscov2testsintheunitedstatesnationalcovidcohortcollaborativen3c
AT zamsteinnoa demonstratinganapproachforevaluatingsyntheticgeospatialandtemporalepidemiologicdatautilityresultsfromanalyzing18millionsarscov2testsintheunitedstatesnationalcovidcohortcollaborativen3c
AT paynephilipro demonstratinganapproachforevaluatingsyntheticgeospatialandtemporalepidemiologicdatautilityresultsfromanalyzing18millionsarscov2testsintheunitedstatesnationalcovidcohortcollaborativen3c
AT wilcoxadamb demonstratinganapproachforevaluatingsyntheticgeospatialandtemporalepidemiologicdatautilityresultsfromanalyzing18millionsarscov2testsintheunitedstatesnationalcovidcohortcollaborativen3c
AT demonstratinganapproachforevaluatingsyntheticgeospatialandtemporalepidemiologicdatautilityresultsfromanalyzing18millionsarscov2testsintheunitedstatesnationalcovidcohortcollaborativen3c