Cargando…
Data Flush
Data perturbation is a technique for generating synthetic data by adding “noise” to raw data, which has an array of applications in science and engineering, primarily in data security and privacy. One challenge for data perturbation is that it usually produces synthetic data resulting in information...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9997048/ https://www.ncbi.nlm.nih.gov/pubmed/36909365 http://dx.doi.org/10.1162/99608f92.681fe3bd |
_version_ | 1784903179102584832 |
---|---|
author | Shen, Xiaotong Bi, Xuan Shen, Rex |
author_facet | Shen, Xiaotong Bi, Xuan Shen, Rex |
author_sort | Shen, Xiaotong |
collection | PubMed |
description | Data perturbation is a technique for generating synthetic data by adding “noise” to raw data, which has an array of applications in science and engineering, primarily in data security and privacy. One challenge for data perturbation is that it usually produces synthetic data resulting in information loss at the expense of privacy protection. The information loss, in turn, renders the accuracy loss for any statistical or machine learning method based on the synthetic data, weakening downstream analysis and deteriorating in machine learning. In this article, we introduce and advocate a fundamental principle of data perturbation, which requires the preservation of the distribution of raw data. To achieve this, we propose a new scheme, named data flush, which ascertains the validity of the downstream analysis and maintains the predictive accuracy of a learning task. It perturbs data nonlinearly while accommodating the requirement of strict privacy protection, for instance, differential privacy. We highlight multiple facets of data flush through examples. |
format | Online Article Text |
id | pubmed-9997048 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
record_format | MEDLINE/PubMed |
spelling | pubmed-99970482023-03-09 Data Flush Shen, Xiaotong Bi, Xuan Shen, Rex Harv Data Sci Rev Article Data perturbation is a technique for generating synthetic data by adding “noise” to raw data, which has an array of applications in science and engineering, primarily in data security and privacy. One challenge for data perturbation is that it usually produces synthetic data resulting in information loss at the expense of privacy protection. The information loss, in turn, renders the accuracy loss for any statistical or machine learning method based on the synthetic data, weakening downstream analysis and deteriorating in machine learning. In this article, we introduce and advocate a fundamental principle of data perturbation, which requires the preservation of the distribution of raw data. To achieve this, we propose a new scheme, named data flush, which ascertains the validity of the downstream analysis and maintains the predictive accuracy of a learning task. It perturbs data nonlinearly while accommodating the requirement of strict privacy protection, for instance, differential privacy. We highlight multiple facets of data flush through examples. 2022 2022-05-09 /pmc/articles/PMC9997048/ /pubmed/36909365 http://dx.doi.org/10.1162/99608f92.681fe3bd Text en https://creativecommons.org/licenses/by/4.0/The article is licensed under a Creative Commons Attribution (CC BY 4.0) International license (https://creativecommons.org/licenses/by/4.0/legalcode (https://creativecommons.org/licenses/by/4.0/) ), except where otherwise indicated with respect to particular material included in the article. The article should be attributed to the author(s) identified above. |
spellingShingle | Article Shen, Xiaotong Bi, Xuan Shen, Rex Data Flush |
title | Data Flush |
title_full | Data Flush |
title_fullStr | Data Flush |
title_full_unstemmed | Data Flush |
title_short | Data Flush |
title_sort | data flush |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9997048/ https://www.ncbi.nlm.nih.gov/pubmed/36909365 http://dx.doi.org/10.1162/99608f92.681fe3bd |
work_keys_str_mv | AT shenxiaotong dataflush AT bixuan dataflush AT shenrex dataflush |