Cargando…

DataSifter II: Partially synthetic data sharing of sensitive information containing time-varying correlated observations

There is a significant public demand for rapid data-driven scientific investigations using aggregated sensitive information. However, many technical challenges and regulatory policies hinder efficient data sharing. In this study, we describe a partially synthetic data generation technique for creati...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhou, Nina, Wang, Lu, Marino, Simeone, Zhao, Yi, Dinov, Ivo D
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9585991/
https://www.ncbi.nlm.nih.gov/pubmed/36274750
http://dx.doi.org/10.1177/17483026211065379
_version_ 1784813611035656192
author Zhou, Nina
Wang, Lu
Marino, Simeone
Zhao, Yi
Dinov, Ivo D
author_facet Zhou, Nina
Wang, Lu
Marino, Simeone
Zhao, Yi
Dinov, Ivo D
author_sort Zhou, Nina
collection PubMed
description There is a significant public demand for rapid data-driven scientific investigations using aggregated sensitive information. However, many technical challenges and regulatory policies hinder efficient data sharing. In this study, we describe a partially synthetic data generation technique for creating anonymized data archives whose joint distributions closely resemble those of the original (sensitive) data. Specifically, we introduce the DataSifter technique for time-varying correlated data (DataSifter II), which relies on an iterative model-based imputation using generalized linear mixed model and random effects-expectation maximization tree. DataSifter II can be used to generate synthetic repeated measures data for testing and validating new analytical techniques. Compared to the multiple imputation method, DataSifter II application on simulated and real clinical data demonstrates that the new method provides extensive reduction of re-identification risk (data privacy) while preserving the analytical value (data utility) in the obfuscated data. The performance of the DataSifter II on a simulation involving 20% artificially missingness in the data, shows at least 80% reduction of the disclosure risk, compared to the multiple imputation method, without a substantial impact on the data analytical value. In a separate clinical data (Medical Information Mart for Intensive Care III) validation, a model-based statistical inference drawn from the original data agrees with an analogous analytical inference obtained using the DataSifter II obfuscated (sifted) data. For large time-varying datasets containing sensitive information, the proposed technique provides an automated tool for alleviating the barriers of data sharing and facilitating effective, advanced, and collaborative analytics.
format Online
Article
Text
id pubmed-9585991
institution National Center for Biotechnology Information
language English
publishDate 2022
record_format MEDLINE/PubMed
spelling pubmed-95859912023-01-20 DataSifter II: Partially synthetic data sharing of sensitive information containing time-varying correlated observations Zhou, Nina Wang, Lu Marino, Simeone Zhao, Yi Dinov, Ivo D J Algorithm Comput Technol Article There is a significant public demand for rapid data-driven scientific investigations using aggregated sensitive information. However, many technical challenges and regulatory policies hinder efficient data sharing. In this study, we describe a partially synthetic data generation technique for creating anonymized data archives whose joint distributions closely resemble those of the original (sensitive) data. Specifically, we introduce the DataSifter technique for time-varying correlated data (DataSifter II), which relies on an iterative model-based imputation using generalized linear mixed model and random effects-expectation maximization tree. DataSifter II can be used to generate synthetic repeated measures data for testing and validating new analytical techniques. Compared to the multiple imputation method, DataSifter II application on simulated and real clinical data demonstrates that the new method provides extensive reduction of re-identification risk (data privacy) while preserving the analytical value (data utility) in the obfuscated data. The performance of the DataSifter II on a simulation involving 20% artificially missingness in the data, shows at least 80% reduction of the disclosure risk, compared to the multiple imputation method, without a substantial impact on the data analytical value. In a separate clinical data (Medical Information Mart for Intensive Care III) validation, a model-based statistical inference drawn from the original data agrees with an analogous analytical inference obtained using the DataSifter II obfuscated (sifted) data. For large time-varying datasets containing sensitive information, the proposed technique provides an automated tool for alleviating the barriers of data sharing and facilitating effective, advanced, and collaborative analytics. 2022 2022-01-20 /pmc/articles/PMC9585991/ /pubmed/36274750 http://dx.doi.org/10.1177/17483026211065379 Text en https://creativecommons.org/licenses/by-nc/4.0/Creative Commons Non Commercial CC BY-NC: This article is distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 License (https://creativecommons.org/licenses/by-nc/4.0/) which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access page (https://us.sagepub.com/en-us/nam/open-access-at-sage).
spellingShingle Article
Zhou, Nina
Wang, Lu
Marino, Simeone
Zhao, Yi
Dinov, Ivo D
DataSifter II: Partially synthetic data sharing of sensitive information containing time-varying correlated observations
title DataSifter II: Partially synthetic data sharing of sensitive information containing time-varying correlated observations
title_full DataSifter II: Partially synthetic data sharing of sensitive information containing time-varying correlated observations
title_fullStr DataSifter II: Partially synthetic data sharing of sensitive information containing time-varying correlated observations
title_full_unstemmed DataSifter II: Partially synthetic data sharing of sensitive information containing time-varying correlated observations
title_short DataSifter II: Partially synthetic data sharing of sensitive information containing time-varying correlated observations
title_sort datasifter ii: partially synthetic data sharing of sensitive information containing time-varying correlated observations
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9585991/
https://www.ncbi.nlm.nih.gov/pubmed/36274750
http://dx.doi.org/10.1177/17483026211065379
work_keys_str_mv AT zhounina datasifteriipartiallysyntheticdatasharingofsensitiveinformationcontainingtimevaryingcorrelatedobservations
AT wanglu datasifteriipartiallysyntheticdatasharingofsensitiveinformationcontainingtimevaryingcorrelatedobservations
AT marinosimeone datasifteriipartiallysyntheticdatasharingofsensitiveinformationcontainingtimevaryingcorrelatedobservations
AT zhaoyi datasifteriipartiallysyntheticdatasharingofsensitiveinformationcontainingtimevaryingcorrelatedobservations
AT dinovivod datasifteriipartiallysyntheticdatasharingofsensitiveinformationcontainingtimevaryingcorrelatedobservations