Cargando…

Scalable Iterative Classification for Sanitizing Large-Scale Datasets

Cheap ubiquitous computing enables the collection of massive amounts of personal data in a wide variety of domains. Many organizations aim to share such data while obscuring features that could disclose personally identifiable information. Much of this data exhibits weak structure (e.g., text), such...

Descripción completa

Detalles Bibliográficos
Autores principales:	Li, Bo, Vorobeychik, Yevgeniy, Li, Muqun, Malin, Bradley
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	2017
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5607782/ https://www.ncbi.nlm.nih.gov/pubmed/28943741 http://dx.doi.org/10.1109/TKDE.2016.2628180

_version_	1783265336412340224
author	Li, Bo Vorobeychik, Yevgeniy Li, Muqun Malin, Bradley
author_facet	Li, Bo Vorobeychik, Yevgeniy Li, Muqun Malin, Bradley
author_sort	Li, Bo
collection	PubMed
description	Cheap ubiquitous computing enables the collection of massive amounts of personal data in a wide variety of domains. Many organizations aim to share such data while obscuring features that could disclose personally identifiable information. Much of this data exhibits weak structure (e.g., text), such that machine learning approaches have been developed to detect and remove identifiers from it. While learning is never perfect, and relying on such approaches to sanitize data can leak sensitive information, a small risk is often acceptable. Our goal is to balance the value of published data and the risk of an adversary discovering leaked identifiers. We model data sanitization as a game between 1) a publisher who chooses a set of classifiers to apply to data and publishes only instances predicted as non-sensitive and 2) an attacker who combines machine learning and manual inspection to uncover leaked identifying information. We introduce a fast iterative greedy algorithm for the publisher that ensures a low utility for a resource-limited adversary. Moreover, using five text data sets we illustrate that our algorithm leaves virtually no automatically identifiable sensitive instances for a state-of-the-art learning algorithm, while sharing over 93% of the original data, and completes after at most 5 iterations.
format	Online Article Text
id	pubmed-5607782
institution	National Center for Biotechnology Information
language	English
publishDate	2017
record_format	MEDLINE/PubMed
spelling	pubmed-56077822018-03-01 Scalable Iterative Classification for Sanitizing Large-Scale Datasets Li, Bo Vorobeychik, Yevgeniy Li, Muqun Malin, Bradley IEEE Trans Knowl Data Eng Article Cheap ubiquitous computing enables the collection of massive amounts of personal data in a wide variety of domains. Many organizations aim to share such data while obscuring features that could disclose personally identifiable information. Much of this data exhibits weak structure (e.g., text), such that machine learning approaches have been developed to detect and remove identifiers from it. While learning is never perfect, and relying on such approaches to sanitize data can leak sensitive information, a small risk is often acceptable. Our goal is to balance the value of published data and the risk of an adversary discovering leaked identifiers. We model data sanitization as a game between 1) a publisher who chooses a set of classifiers to apply to data and publishes only instances predicted as non-sensitive and 2) an attacker who combines machine learning and manual inspection to uncover leaked identifying information. We introduce a fast iterative greedy algorithm for the publisher that ensures a low utility for a resource-limited adversary. Moreover, using five text data sets we illustrate that our algorithm leaves virtually no automatically identifiable sensitive instances for a state-of-the-art learning algorithm, while sharing over 93% of the original data, and completes after at most 5 iterations. 2017-03-01 2016-11-11 /pmc/articles/PMC5607782/ /pubmed/28943741 http://dx.doi.org/10.1109/TKDE.2016.2628180 Text en https://creativecommons.org/licenses/by/4.0/Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
spellingShingle	Article Li, Bo Vorobeychik, Yevgeniy Li, Muqun Malin, Bradley Scalable Iterative Classification for Sanitizing Large-Scale Datasets
title	Scalable Iterative Classification for Sanitizing Large-Scale Datasets
title_full	Scalable Iterative Classification for Sanitizing Large-Scale Datasets
title_fullStr	Scalable Iterative Classification for Sanitizing Large-Scale Datasets
title_full_unstemmed	Scalable Iterative Classification for Sanitizing Large-Scale Datasets
title_short	Scalable Iterative Classification for Sanitizing Large-Scale Datasets
title_sort	scalable iterative classification for sanitizing large-scale datasets
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5607782/ https://www.ncbi.nlm.nih.gov/pubmed/28943741 http://dx.doi.org/10.1109/TKDE.2016.2628180
work_keys_str_mv	AT libo scalableiterativeclassificationforsanitizinglargescaledatasets AT vorobeychikyevgeniy scalableiterativeclassificationforsanitizinglargescaledatasets AT limuqun scalableiterativeclassificationforsanitizinglargescaledatasets AT malinbradley scalableiterativeclassificationforsanitizinglargescaledatasets

Scalable Iterative Classification for Sanitizing Large-Scale Datasets

Ejemplares similares