Cargando…

Using word embeddings to improve the privacy of clinical notes

OBJECTIVE: In this work, we introduce a privacy technique for anonymizing clinical notes that guarantees all private health information is secured (including sensitive data, such as family history, that are not adequately covered by current techniques). MATERIALS AND METHODS: We employ a new “random...

Descripción completa

Detalles Bibliográficos
Autores principales:	Abdalla, Mohamed, Abdalla, Moustafa, Rudzicz, Frank, Hirst, Graeme
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2020
Materias:	Research and Applications
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7309261/ https://www.ncbi.nlm.nih.gov/pubmed/32388549 http://dx.doi.org/10.1093/jamia/ocaa038

_version_	1783549177324634112
author	Abdalla, Mohamed Abdalla, Moustafa Rudzicz, Frank Hirst, Graeme
author_facet	Abdalla, Mohamed Abdalla, Moustafa Rudzicz, Frank Hirst, Graeme
author_sort	Abdalla, Mohamed
collection	PubMed
description	OBJECTIVE: In this work, we introduce a privacy technique for anonymizing clinical notes that guarantees all private health information is secured (including sensitive data, such as family history, that are not adequately covered by current techniques). MATERIALS AND METHODS: We employ a new “random replacement” paradigm (replacing each token in clinical notes with neighboring word vectors from the embedding space) to achieve 100% recall on the removal of sensitive information, unachievable with current “search-and-secure” paradigms. We demonstrate the utility of this paradigm on multiple corpora in a diverse set of classification tasks. RESULTS: We empirically evaluate the effect of our anonymization technique both on upstream and downstream natural language processing tasks to show that our perturbations, while increasing security (ie, achieving 100% recall on any dataset), do not greatly impact the results of end-to-end machine learning approaches. DISCUSSION: As long as current approaches utilize precision and recall to evaluate deidentification algorithms, there will remain a risk of overlooking sensitive information. Inspired by differential privacy, we sought to make it statistically infeasible to recreate the original data, although at the cost of readability. We hope that the work will serve as a catalyst to further research into alternative deidentification methods that can address current weaknesses. CONCLUSION: Our proposed technique can secure clinical texts at a low cost and extremely high recall with a readability trade-off while remaining useful for natural language processing classification tasks. We hope that our work can be used by risk-averse data holders to release clinical texts to researchers.
format	Online Article Text
id	pubmed-7309261
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-73092612020-06-29 Using word embeddings to improve the privacy of clinical notes Abdalla, Mohamed Abdalla, Moustafa Rudzicz, Frank Hirst, Graeme J Am Med Inform Assoc Research and Applications OBJECTIVE: In this work, we introduce a privacy technique for anonymizing clinical notes that guarantees all private health information is secured (including sensitive data, such as family history, that are not adequately covered by current techniques). MATERIALS AND METHODS: We employ a new “random replacement” paradigm (replacing each token in clinical notes with neighboring word vectors from the embedding space) to achieve 100% recall on the removal of sensitive information, unachievable with current “search-and-secure” paradigms. We demonstrate the utility of this paradigm on multiple corpora in a diverse set of classification tasks. RESULTS: We empirically evaluate the effect of our anonymization technique both on upstream and downstream natural language processing tasks to show that our perturbations, while increasing security (ie, achieving 100% recall on any dataset), do not greatly impact the results of end-to-end machine learning approaches. DISCUSSION: As long as current approaches utilize precision and recall to evaluate deidentification algorithms, there will remain a risk of overlooking sensitive information. Inspired by differential privacy, we sought to make it statistically infeasible to recreate the original data, although at the cost of readability. We hope that the work will serve as a catalyst to further research into alternative deidentification methods that can address current weaknesses. CONCLUSION: Our proposed technique can secure clinical texts at a low cost and extremely high recall with a readability trade-off while remaining useful for natural language processing classification tasks. We hope that our work can be used by risk-averse data holders to release clinical texts to researchers. Oxford University Press 2020-05-10 /pmc/articles/PMC7309261/ /pubmed/32388549 http://dx.doi.org/10.1093/jamia/ocaa038 Text en © The Author(s) 2020. Published by Oxford University Press on behalf of the American Medical Informatics Association. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Research and Applications Abdalla, Mohamed Abdalla, Moustafa Rudzicz, Frank Hirst, Graeme Using word embeddings to improve the privacy of clinical notes
title	Using word embeddings to improve the privacy of clinical notes
title_full	Using word embeddings to improve the privacy of clinical notes
title_fullStr	Using word embeddings to improve the privacy of clinical notes
title_full_unstemmed	Using word embeddings to improve the privacy of clinical notes
title_short	Using word embeddings to improve the privacy of clinical notes
title_sort	using word embeddings to improve the privacy of clinical notes
topic	Research and Applications
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7309261/ https://www.ncbi.nlm.nih.gov/pubmed/32388549 http://dx.doi.org/10.1093/jamia/ocaa038
work_keys_str_mv	AT abdallamohamed usingwordembeddingstoimprovetheprivacyofclinicalnotes AT abdallamoustafa usingwordembeddingstoimprovetheprivacyofclinicalnotes AT rudziczfrank usingwordembeddingstoimprovetheprivacyofclinicalnotes AT hirstgraeme usingwordembeddingstoimprovetheprivacyofclinicalnotes

Using word embeddings to improve the privacy of clinical notes

Ejemplares similares