Cargando…

Deidentification of free-text medical records using pre-trained bidirectional transformers

The ability of caregivers and investigators to share patient data is fundamental to many areas of clinical practice and biomedical research. Prior to sharing, it is often necessary to remove identifiers such as names, contact details, and dates in order to protect patient privacy. Deidentification,...

Descripción completa

Detalles Bibliográficos
Autores principales: Johnson, Alistair E. W., Bulgarelli, Lucas, Pollard, Tom J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8330601/
https://www.ncbi.nlm.nih.gov/pubmed/34350426
http://dx.doi.org/10.1145/3368555.3384455
_version_ 1783732754695847936
author Johnson, Alistair E. W.
Bulgarelli, Lucas
Pollard, Tom J.
author_facet Johnson, Alistair E. W.
Bulgarelli, Lucas
Pollard, Tom J.
author_sort Johnson, Alistair E. W.
collection PubMed
description The ability of caregivers and investigators to share patient data is fundamental to many areas of clinical practice and biomedical research. Prior to sharing, it is often necessary to remove identifiers such as names, contact details, and dates in order to protect patient privacy. Deidentification, the process of removing identifiers, is challenging, however. High-quality annotated data for developing models is scarce; many target identifiers are highly heterogenous (for example, there are uncountable variations of patient names); and in practice anything less than perfect sensitivity may be considered a failure. As a result, patient data is often withheld when sharing would be beneficial, and identifiable patient data is often divulged when a deidentified version would suffice. In recent years, advances in machine learning methods have led to rapid performance improvements in natural language processing tasks, in particular with the advent of large-scale pretrained language models. In this paper we develop and evaluate an approach for deidentification of clinical notes based on a bidirectional transformer model. We propose human interpretable evaluation measures and demonstrate state of the art performance against modern baseline models. Finally, we highlight current challenges in deidentification, including the absence of clear annotation guidelines, lack of portability of models, and paucity of training data. Code to develop our model is open source, allowing for broad reuse.
format Online
Article
Text
id pubmed-8330601
institution National Center for Biotechnology Information
language English
publishDate 2020
record_format MEDLINE/PubMed
spelling pubmed-83306012021-08-03 Deidentification of free-text medical records using pre-trained bidirectional transformers Johnson, Alistair E. W. Bulgarelli, Lucas Pollard, Tom J. Proc ACM Conf Health Inference Learn (2020) Article The ability of caregivers and investigators to share patient data is fundamental to many areas of clinical practice and biomedical research. Prior to sharing, it is often necessary to remove identifiers such as names, contact details, and dates in order to protect patient privacy. Deidentification, the process of removing identifiers, is challenging, however. High-quality annotated data for developing models is scarce; many target identifiers are highly heterogenous (for example, there are uncountable variations of patient names); and in practice anything less than perfect sensitivity may be considered a failure. As a result, patient data is often withheld when sharing would be beneficial, and identifiable patient data is often divulged when a deidentified version would suffice. In recent years, advances in machine learning methods have led to rapid performance improvements in natural language processing tasks, in particular with the advent of large-scale pretrained language models. In this paper we develop and evaluate an approach for deidentification of clinical notes based on a bidirectional transformer model. We propose human interpretable evaluation measures and demonstrate state of the art performance against modern baseline models. Finally, we highlight current challenges in deidentification, including the absence of clear annotation guidelines, lack of portability of models, and paucity of training data. Code to develop our model is open source, allowing for broad reuse. 2020-04-02 2020-04 /pmc/articles/PMC8330601/ /pubmed/34350426 http://dx.doi.org/10.1145/3368555.3384455 Text en https://creativecommons.org/licenses/by-nc/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License (https://creativecommons.org/licenses/by-nc/4.0/) .
spellingShingle Article
Johnson, Alistair E. W.
Bulgarelli, Lucas
Pollard, Tom J.
Deidentification of free-text medical records using pre-trained bidirectional transformers
title Deidentification of free-text medical records using pre-trained bidirectional transformers
title_full Deidentification of free-text medical records using pre-trained bidirectional transformers
title_fullStr Deidentification of free-text medical records using pre-trained bidirectional transformers
title_full_unstemmed Deidentification of free-text medical records using pre-trained bidirectional transformers
title_short Deidentification of free-text medical records using pre-trained bidirectional transformers
title_sort deidentification of free-text medical records using pre-trained bidirectional transformers
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8330601/
https://www.ncbi.nlm.nih.gov/pubmed/34350426
http://dx.doi.org/10.1145/3368555.3384455
work_keys_str_mv AT johnsonalistairew deidentificationoffreetextmedicalrecordsusingpretrainedbidirectionaltransformers
AT bulgarellilucas deidentificationoffreetextmedicalrecordsusingpretrainedbidirectionaltransformers
AT pollardtomj deidentificationoffreetextmedicalrecordsusingpretrainedbidirectionaltransformers