Cargando…

Exploring the Privacy-Preserving Properties of Word Embeddings: Algorithmic Validation Study

BACKGROUND: Word embeddings are dense numeric vectors used to represent language in neural networks. Until recently, there had been no publicly released embeddings trained on clinical data. Our work is the first to study the privacy implications of releasing these models. OBJECTIVE: This paper aims...

Descripción completa

Detalles Bibliográficos
Autores principales:	Abdalla, Mohamed, Abdalla, Moustafa, Hirst, Graeme, Rudzicz, Frank
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	JMIR Publications 2020
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7391163/ https://www.ncbi.nlm.nih.gov/pubmed/32673230 http://dx.doi.org/10.2196/18055

_version_	1783564585294364672
author	Abdalla, Mohamed Abdalla, Moustafa Hirst, Graeme Rudzicz, Frank
author_facet	Abdalla, Mohamed Abdalla, Moustafa Hirst, Graeme Rudzicz, Frank
author_sort	Abdalla, Mohamed
collection	PubMed
description	BACKGROUND: Word embeddings are dense numeric vectors used to represent language in neural networks. Until recently, there had been no publicly released embeddings trained on clinical data. Our work is the first to study the privacy implications of releasing these models. OBJECTIVE: This paper aims to demonstrate that traditional word embeddings created on clinical corpora that have been deidentified by removing personal health information (PHI) can nonetheless be exploited to reveal sensitive patient information. METHODS: We used embeddings created from 400,000 doctor-written consultation notes and experimented with 3 common word embedding methods to explore the privacy-preserving properties of each. RESULTS: We found that if publicly released embeddings are trained from a corpus anonymized by PHI removal, it is possible to reconstruct up to 68.5% (n=411/600) of the full names that remain in the deidentified corpus and associated sensitive information to specific patients in the corpus from which the embeddings were created. We also found that the distance between the word vector representation of a patient’s name and a diagnostic billing code is informative and differs significantly from the distance between the name and a code not billed for that patient. CONCLUSIONS: Special care must be taken when sharing word embeddings created from clinical texts, as current approaches may compromise patient privacy. If PHI removal is used for anonymization before traditional word embeddings are trained, it is possible to attribute sensitive information to patients who have not been fully deidentified by the (necessarily imperfect) removal algorithms. A promising alternative (ie, anonymization by PHI replacement) may avoid these flaws. Our results are timely and critical, as an increasing number of researchers are pushing for publicly available health data.
format	Online Article Text
id	pubmed-7391163
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	JMIR Publications
record_format	MEDLINE/PubMed
spelling	pubmed-73911632020-08-12 Exploring the Privacy-Preserving Properties of Word Embeddings: Algorithmic Validation Study Abdalla, Mohamed Abdalla, Moustafa Hirst, Graeme Rudzicz, Frank J Med Internet Res Original Paper BACKGROUND: Word embeddings are dense numeric vectors used to represent language in neural networks. Until recently, there had been no publicly released embeddings trained on clinical data. Our work is the first to study the privacy implications of releasing these models. OBJECTIVE: This paper aims to demonstrate that traditional word embeddings created on clinical corpora that have been deidentified by removing personal health information (PHI) can nonetheless be exploited to reveal sensitive patient information. METHODS: We used embeddings created from 400,000 doctor-written consultation notes and experimented with 3 common word embedding methods to explore the privacy-preserving properties of each. RESULTS: We found that if publicly released embeddings are trained from a corpus anonymized by PHI removal, it is possible to reconstruct up to 68.5% (n=411/600) of the full names that remain in the deidentified corpus and associated sensitive information to specific patients in the corpus from which the embeddings were created. We also found that the distance between the word vector representation of a patient’s name and a diagnostic billing code is informative and differs significantly from the distance between the name and a code not billed for that patient. CONCLUSIONS: Special care must be taken when sharing word embeddings created from clinical texts, as current approaches may compromise patient privacy. If PHI removal is used for anonymization before traditional word embeddings are trained, it is possible to attribute sensitive information to patients who have not been fully deidentified by the (necessarily imperfect) removal algorithms. A promising alternative (ie, anonymization by PHI replacement) may avoid these flaws. Our results are timely and critical, as an increasing number of researchers are pushing for publicly available health data. JMIR Publications 2020-07-15 /pmc/articles/PMC7391163/ /pubmed/32673230 http://dx.doi.org/10.2196/18055 Text en ©Mohamed Abdalla, Moustafa Abdalla, Graeme Hirst, Frank Rudzicz. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 15.07.2020. https://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
spellingShingle	Original Paper Abdalla, Mohamed Abdalla, Moustafa Hirst, Graeme Rudzicz, Frank Exploring the Privacy-Preserving Properties of Word Embeddings: Algorithmic Validation Study
title	Exploring the Privacy-Preserving Properties of Word Embeddings: Algorithmic Validation Study
title_full	Exploring the Privacy-Preserving Properties of Word Embeddings: Algorithmic Validation Study
title_fullStr	Exploring the Privacy-Preserving Properties of Word Embeddings: Algorithmic Validation Study
title_full_unstemmed	Exploring the Privacy-Preserving Properties of Word Embeddings: Algorithmic Validation Study
title_short	Exploring the Privacy-Preserving Properties of Word Embeddings: Algorithmic Validation Study
title_sort	exploring the privacy-preserving properties of word embeddings: algorithmic validation study
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7391163/ https://www.ncbi.nlm.nih.gov/pubmed/32673230 http://dx.doi.org/10.2196/18055
work_keys_str_mv	AT abdallamohamed exploringtheprivacypreservingpropertiesofwordembeddingsalgorithmicvalidationstudy AT abdallamoustafa exploringtheprivacypreservingpropertiesofwordembeddingsalgorithmicvalidationstudy AT hirstgraeme exploringtheprivacypreservingpropertiesofwordembeddingsalgorithmicvalidationstudy AT rudziczfrank exploringtheprivacypreservingpropertiesofwordembeddingsalgorithmicvalidationstudy

Exploring the Privacy-Preserving Properties of Word Embeddings: Algorithmic Validation Study

Ejemplares similares