Cargando…

Data Flush

Data perturbation is a technique for generating synthetic data by adding “noise” to raw data, which has an array of applications in science and engineering, primarily in data security and privacy. One challenge for data perturbation is that it usually produces synthetic data resulting in information...

Descripción completa

Detalles Bibliográficos
Autores principales: Shen, Xiaotong, Bi, Xuan, Shen, Rex
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9997048/
https://www.ncbi.nlm.nih.gov/pubmed/36909365
http://dx.doi.org/10.1162/99608f92.681fe3bd
_version_ 1784903179102584832
author Shen, Xiaotong
Bi, Xuan
Shen, Rex
author_facet Shen, Xiaotong
Bi, Xuan
Shen, Rex
author_sort Shen, Xiaotong
collection PubMed
description Data perturbation is a technique for generating synthetic data by adding “noise” to raw data, which has an array of applications in science and engineering, primarily in data security and privacy. One challenge for data perturbation is that it usually produces synthetic data resulting in information loss at the expense of privacy protection. The information loss, in turn, renders the accuracy loss for any statistical or machine learning method based on the synthetic data, weakening downstream analysis and deteriorating in machine learning. In this article, we introduce and advocate a fundamental principle of data perturbation, which requires the preservation of the distribution of raw data. To achieve this, we propose a new scheme, named data flush, which ascertains the validity of the downstream analysis and maintains the predictive accuracy of a learning task. It perturbs data nonlinearly while accommodating the requirement of strict privacy protection, for instance, differential privacy. We highlight multiple facets of data flush through examples.
format Online
Article
Text
id pubmed-9997048
institution National Center for Biotechnology Information
language English
publishDate 2022
record_format MEDLINE/PubMed
spelling pubmed-99970482023-03-09 Data Flush Shen, Xiaotong Bi, Xuan Shen, Rex Harv Data Sci Rev Article Data perturbation is a technique for generating synthetic data by adding “noise” to raw data, which has an array of applications in science and engineering, primarily in data security and privacy. One challenge for data perturbation is that it usually produces synthetic data resulting in information loss at the expense of privacy protection. The information loss, in turn, renders the accuracy loss for any statistical or machine learning method based on the synthetic data, weakening downstream analysis and deteriorating in machine learning. In this article, we introduce and advocate a fundamental principle of data perturbation, which requires the preservation of the distribution of raw data. To achieve this, we propose a new scheme, named data flush, which ascertains the validity of the downstream analysis and maintains the predictive accuracy of a learning task. It perturbs data nonlinearly while accommodating the requirement of strict privacy protection, for instance, differential privacy. We highlight multiple facets of data flush through examples. 2022 2022-05-09 /pmc/articles/PMC9997048/ /pubmed/36909365 http://dx.doi.org/10.1162/99608f92.681fe3bd Text en https://creativecommons.org/licenses/by/4.0/The article is licensed under a Creative Commons Attribution (CC BY 4.0) International license (https://creativecommons.org/licenses/by/4.0/legalcode (https://creativecommons.org/licenses/by/4.0/) ), except where otherwise indicated with respect to particular material included in the article. The article should be attributed to the author(s) identified above.
spellingShingle Article
Shen, Xiaotong
Bi, Xuan
Shen, Rex
Data Flush
title Data Flush
title_full Data Flush
title_fullStr Data Flush
title_full_unstemmed Data Flush
title_short Data Flush
title_sort data flush
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9997048/
https://www.ncbi.nlm.nih.gov/pubmed/36909365
http://dx.doi.org/10.1162/99608f92.681fe3bd
work_keys_str_mv AT shenxiaotong dataflush
AT bixuan dataflush
AT shenrex dataflush