Cargando…

Feasibility of Reidentifying Individuals in Large National Physical Activity Data Sets From Which Protected Health Information Has Been Removed With Use of Machine Learning

IMPORTANCE: Despite data aggregation and removal of protected health information, there is concern that deidentified physical activity (PA) data collected from wearable devices can be reidentified. Organizations collecting or distributing such data suggest that the aforementioned measures are suffic...

Descripción completa

Detalles Bibliográficos
Autores principales:	Na, Liangyuan, Yang, Cong, Lo, Chi-Cheng, Zhao, Fangyuan, Fukuoka, Yoshimi, Aswani, Anil
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	American Medical Association 2018
Materias:	Original Investigation
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6324329/ https://www.ncbi.nlm.nih.gov/pubmed/30646312 http://dx.doi.org/10.1001/jamanetworkopen.2018.6040

_version_	1783385948735668224
author	Na, Liangyuan Yang, Cong Lo, Chi-Cheng Zhao, Fangyuan Fukuoka, Yoshimi Aswani, Anil
author_facet	Na, Liangyuan Yang, Cong Lo, Chi-Cheng Zhao, Fangyuan Fukuoka, Yoshimi Aswani, Anil
author_sort	Na, Liangyuan
collection	PubMed
description	IMPORTANCE: Despite data aggregation and removal of protected health information, there is concern that deidentified physical activity (PA) data collected from wearable devices can be reidentified. Organizations collecting or distributing such data suggest that the aforementioned measures are sufficient to ensure privacy. However, no studies, to our knowledge, have been published that demonstrate the possibility or impossibility of reidentifying such activity data. OBJECTIVE: To evaluate the feasibility of reidentifying accelerometer-measured PA data, which have had geographic and protected health information removed, using support vector machines (SVMs) and random forest methods from machine learning. DESIGN, SETTING, AND PARTICIPANTS: In this cross-sectional study, the National Health and Nutrition Examination Survey (NHANES) 2003-2004 and 2005-2006 data sets were analyzed in 2018. The accelerometer-measured PA data were collected in a free-living setting for 7 continuous days. NHANES uses a multistage probability sampling design to select a sample that is representative of the civilian noninstitutionalized household (both adult and children) population of the United States. EXPOSURES: The NHANES data sets contain objectively measured movement intensity as recorded by accelerometers worn during all walking for 1 week. MAIN OUTCOMES AND MEASURES: The primary outcome was the ability of the random forest and linear SVM algorithms to match demographic and 20-minute aggregated PA data to individual-specific record numbers, and the percentage of correct matches by each machine learning algorithm was the measure. RESULTS: A total of 4720 adults (mean [SD] age, 40.0 [20.6] years) and 2427 children (mean [SD] age, 12.3 [3.4] years) in NHANES 2003-2004 and 4765 adults (mean [SD] age, 45.2 [19.9] years) and 2539 children (mean [SD] age, 12.1 [3.4] years) in NHANES 2005-2006 were included in the study. The random forest algorithm successfully reidentified the demographic and 20-minute aggregated PA data of 4478 adults (94.9%) and 2120 children (87.4%) in NHANES 2003-2004 and 4470 adults (93.8%) and 2172 children (85.5%) in NHANES 2005-2006 (P < .001 for all). The linear SVM algorithm successfully reidentified the demographic and 20-minute aggregated PA data of 4043 adults (85.6%) and 1695 children (69.8%) in NHANES 2003-2004 and 4041 adults (84.8%) and 1705 children (67.2%) in NHANES 2005-2006 (P < .001 for all). CONCLUSIONS AND RELEVANCE: This study suggests that current practices for deidentification of accelerometer-measured PA data might be insufficient to ensure privacy. This finding has important policy implications because it appears to show the need for deidentification that aggregates the PA data of multiple individuals to ensure privacy for single individuals.
format	Online Article Text
id	pubmed-6324329
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	American Medical Association
record_format	MEDLINE/PubMed
spelling	pubmed-63243292019-01-22 Feasibility of Reidentifying Individuals in Large National Physical Activity Data Sets From Which Protected Health Information Has Been Removed With Use of Machine Learning Na, Liangyuan Yang, Cong Lo, Chi-Cheng Zhao, Fangyuan Fukuoka, Yoshimi Aswani, Anil JAMA Netw Open Original Investigation IMPORTANCE: Despite data aggregation and removal of protected health information, there is concern that deidentified physical activity (PA) data collected from wearable devices can be reidentified. Organizations collecting or distributing such data suggest that the aforementioned measures are sufficient to ensure privacy. However, no studies, to our knowledge, have been published that demonstrate the possibility or impossibility of reidentifying such activity data. OBJECTIVE: To evaluate the feasibility of reidentifying accelerometer-measured PA data, which have had geographic and protected health information removed, using support vector machines (SVMs) and random forest methods from machine learning. DESIGN, SETTING, AND PARTICIPANTS: In this cross-sectional study, the National Health and Nutrition Examination Survey (NHANES) 2003-2004 and 2005-2006 data sets were analyzed in 2018. The accelerometer-measured PA data were collected in a free-living setting for 7 continuous days. NHANES uses a multistage probability sampling design to select a sample that is representative of the civilian noninstitutionalized household (both adult and children) population of the United States. EXPOSURES: The NHANES data sets contain objectively measured movement intensity as recorded by accelerometers worn during all walking for 1 week. MAIN OUTCOMES AND MEASURES: The primary outcome was the ability of the random forest and linear SVM algorithms to match demographic and 20-minute aggregated PA data to individual-specific record numbers, and the percentage of correct matches by each machine learning algorithm was the measure. RESULTS: A total of 4720 adults (mean [SD] age, 40.0 [20.6] years) and 2427 children (mean [SD] age, 12.3 [3.4] years) in NHANES 2003-2004 and 4765 adults (mean [SD] age, 45.2 [19.9] years) and 2539 children (mean [SD] age, 12.1 [3.4] years) in NHANES 2005-2006 were included in the study. The random forest algorithm successfully reidentified the demographic and 20-minute aggregated PA data of 4478 adults (94.9%) and 2120 children (87.4%) in NHANES 2003-2004 and 4470 adults (93.8%) and 2172 children (85.5%) in NHANES 2005-2006 (P < .001 for all). The linear SVM algorithm successfully reidentified the demographic and 20-minute aggregated PA data of 4043 adults (85.6%) and 1695 children (69.8%) in NHANES 2003-2004 and 4041 adults (84.8%) and 1705 children (67.2%) in NHANES 2005-2006 (P < .001 for all). CONCLUSIONS AND RELEVANCE: This study suggests that current practices for deidentification of accelerometer-measured PA data might be insufficient to ensure privacy. This finding has important policy implications because it appears to show the need for deidentification that aggregates the PA data of multiple individuals to ensure privacy for single individuals. American Medical Association 2018-12-21 /pmc/articles/PMC6324329/ /pubmed/30646312 http://dx.doi.org/10.1001/jamanetworkopen.2018.6040 Text en Copyright 2018 Na L et al. JAMA Network Open. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the CC-BY License.
spellingShingle	Original Investigation Na, Liangyuan Yang, Cong Lo, Chi-Cheng Zhao, Fangyuan Fukuoka, Yoshimi Aswani, Anil Feasibility of Reidentifying Individuals in Large National Physical Activity Data Sets From Which Protected Health Information Has Been Removed With Use of Machine Learning
title	Feasibility of Reidentifying Individuals in Large National Physical Activity Data Sets From Which Protected Health Information Has Been Removed With Use of Machine Learning
title_full	Feasibility of Reidentifying Individuals in Large National Physical Activity Data Sets From Which Protected Health Information Has Been Removed With Use of Machine Learning
title_fullStr	Feasibility of Reidentifying Individuals in Large National Physical Activity Data Sets From Which Protected Health Information Has Been Removed With Use of Machine Learning
title_full_unstemmed	Feasibility of Reidentifying Individuals in Large National Physical Activity Data Sets From Which Protected Health Information Has Been Removed With Use of Machine Learning
title_short	Feasibility of Reidentifying Individuals in Large National Physical Activity Data Sets From Which Protected Health Information Has Been Removed With Use of Machine Learning
title_sort	feasibility of reidentifying individuals in large national physical activity data sets from which protected health information has been removed with use of machine learning
topic	Original Investigation
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6324329/ https://www.ncbi.nlm.nih.gov/pubmed/30646312 http://dx.doi.org/10.1001/jamanetworkopen.2018.6040
work_keys_str_mv	AT naliangyuan feasibilityofreidentifyingindividualsinlargenationalphysicalactivitydatasetsfromwhichprotectedhealthinformationhasbeenremovedwithuseofmachinelearning AT yangcong feasibilityofreidentifyingindividualsinlargenationalphysicalactivitydatasetsfromwhichprotectedhealthinformationhasbeenremovedwithuseofmachinelearning AT lochicheng feasibilityofreidentifyingindividualsinlargenationalphysicalactivitydatasetsfromwhichprotectedhealthinformationhasbeenremovedwithuseofmachinelearning AT zhaofangyuan feasibilityofreidentifyingindividualsinlargenationalphysicalactivitydatasetsfromwhichprotectedhealthinformationhasbeenremovedwithuseofmachinelearning AT fukuokayoshimi feasibilityofreidentifyingindividualsinlargenationalphysicalactivitydatasetsfromwhichprotectedhealthinformationhasbeenremovedwithuseofmachinelearning AT aswanianil feasibilityofreidentifyingindividualsinlargenationalphysicalactivitydatasetsfromwhichprotectedhealthinformationhasbeenremovedwithuseofmachinelearning

Feasibility of Reidentifying Individuals in Large National Physical Activity Data Sets From Which Protected Health Information Has Been Removed With Use of Machine Learning

Ejemplares similares