Cargando…

Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data

AI-based data synthesis has seen rapid progress over the last several years and is increasingly recognized for its promise to enable privacy-respecting high-fidelity data sharing. This is reflected by the growing availability of both commercial and open-sourced software solutions for synthesizing pr...

Descripción completa

Detalles Bibliográficos
Autores principales:	Platzer, Michael, Reutterer, Thomas
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2021
Materias:	Big Data
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8276128/ https://www.ncbi.nlm.nih.gov/pubmed/34268491 http://dx.doi.org/10.3389/fdata.2021.679939

_version_	1783721851089846272
author	Platzer, Michael Reutterer, Thomas
author_facet	Platzer, Michael Reutterer, Thomas
author_sort	Platzer, Michael
collection	PubMed
description	AI-based data synthesis has seen rapid progress over the last several years and is increasingly recognized for its promise to enable privacy-respecting high-fidelity data sharing. This is reflected by the growing availability of both commercial and open-sourced software solutions for synthesizing private data. However, despite these recent advances, adequately evaluating the quality of generated synthetic datasets is still an open challenge. We aim to close this gap and introduce a novel holdout-based empirical assessment framework for quantifying the fidelity as well as the privacy risk of synthetic data solutions for mixed-type tabular data. Measuring fidelity is based on statistical distances of lower-dimensional marginal distributions, which provide a model-free and easy-to-communicate empirical metric for the representativeness of a synthetic dataset. Privacy risk is assessed by calculating the individual-level distances to closest record with respect to the training data. By showing that the synthetic samples are just as close to the training as to the holdout data, we yield strong evidence that the synthesizer indeed learned to generalize patterns and is independent of individual training records. We empirically demonstrate the presented framework for seven distinct synthetic data solutions across four mixed-type datasets and compare these then to traditional data perturbation techniques. Both a Python-based implementation of the proposed metrics and the demonstration study setup is made available open-source. The results highlight the need to systematically assess the fidelity just as well as the privacy of these emerging class of synthetic data generators.
format	Online Article Text
id	pubmed-8276128
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-82761282021-07-14 Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data Platzer, Michael Reutterer, Thomas Front Big Data Big Data AI-based data synthesis has seen rapid progress over the last several years and is increasingly recognized for its promise to enable privacy-respecting high-fidelity data sharing. This is reflected by the growing availability of both commercial and open-sourced software solutions for synthesizing private data. However, despite these recent advances, adequately evaluating the quality of generated synthetic datasets is still an open challenge. We aim to close this gap and introduce a novel holdout-based empirical assessment framework for quantifying the fidelity as well as the privacy risk of synthetic data solutions for mixed-type tabular data. Measuring fidelity is based on statistical distances of lower-dimensional marginal distributions, which provide a model-free and easy-to-communicate empirical metric for the representativeness of a synthetic dataset. Privacy risk is assessed by calculating the individual-level distances to closest record with respect to the training data. By showing that the synthetic samples are just as close to the training as to the holdout data, we yield strong evidence that the synthesizer indeed learned to generalize patterns and is independent of individual training records. We empirically demonstrate the presented framework for seven distinct synthetic data solutions across four mixed-type datasets and compare these then to traditional data perturbation techniques. Both a Python-based implementation of the proposed metrics and the demonstration study setup is made available open-source. The results highlight the need to systematically assess the fidelity just as well as the privacy of these emerging class of synthetic data generators. Frontiers Media S.A. 2021-06-29 /pmc/articles/PMC8276128/ /pubmed/34268491 http://dx.doi.org/10.3389/fdata.2021.679939 Text en Copyright © 2021 Platzer and Reutterer. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Big Data Platzer, Michael Reutterer, Thomas Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data
title	Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data
title_full	Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data
title_fullStr	Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data
title_full_unstemmed	Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data
title_short	Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data
title_sort	holdout-based empirical assessment of mixed-type synthetic data
topic	Big Data
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8276128/ https://www.ncbi.nlm.nih.gov/pubmed/34268491 http://dx.doi.org/10.3389/fdata.2021.679939
work_keys_str_mv	AT platzermichael holdoutbasedempiricalassessmentofmixedtypesyntheticdata AT reuttererthomas holdoutbasedempiricalassessmentofmixedtypesyntheticdata

Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data

Ejemplares similares