Cargando…

A study of deep learning methods for de-identification of clinical notes in cross-institute settings

BACKGROUND: De-identification is a critical technology to facilitate the use of unstructured clinical text while protecting patient privacy and confidentiality. The clinical natural language processing (NLP) community has invested great efforts in developing methods and corpora for de-identification...

Descripción completa

Detalles Bibliográficos
Autores principales: Yang, Xi, Lyu, Tianchen, Li, Qian, Lee, Chih-Yin, Bian, Jiang, Hogan, William R., Wu, Yonghui
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6894104/
https://www.ncbi.nlm.nih.gov/pubmed/31801524
http://dx.doi.org/10.1186/s12911-019-0935-4
_version_ 1783476323553902592
author Yang, Xi
Lyu, Tianchen
Li, Qian
Lee, Chih-Yin
Bian, Jiang
Hogan, William R.
Wu, Yonghui
author_facet Yang, Xi
Lyu, Tianchen
Li, Qian
Lee, Chih-Yin
Bian, Jiang
Hogan, William R.
Wu, Yonghui
author_sort Yang, Xi
collection PubMed
description BACKGROUND: De-identification is a critical technology to facilitate the use of unstructured clinical text while protecting patient privacy and confidentiality. The clinical natural language processing (NLP) community has invested great efforts in developing methods and corpora for de-identification of clinical notes. These annotated corpora are valuable resources for developing automated systems to de-identify clinical text at local hospitals. However, existing studies often utilized training and test data collected from the same institution. There are few studies to explore automated de-identification under cross-institute settings. The goal of this study is to examine deep learning-based de-identification methods at a cross-institute setting, identify the bottlenecks, and provide potential solutions. METHODS: We created a de-identification corpus using a total 500 clinical notes from the University of Florida (UF) Health, developed deep learning-based de-identification models using 2014 i2b2/UTHealth corpus, and evaluated the performance using UF corpus. We compared five different word embeddings trained from the general English text, clinical text, and biomedical literature, explored lexical and linguistic features, and compared two strategies to customize the deep learning models using UF notes and resources. RESULTS: Pre-trained word embeddings using a general English corpus achieved better performance than embeddings from de-identified clinical text and biomedical literature. The performance of deep learning models trained using only i2b2 corpus significantly dropped (strict and relax F1 scores dropped from 0.9547 and 0.9646 to 0.8568 and 0.8958) when applied to another corpus annotated at UF Health. Linguistic features could further improve the performance of de-identification in cross-institute settings. After customizing the models using UF notes and resource, the best model achieved the strict and relaxed F1 scores of 0.9288 and 0.9584, respectively. CONCLUSIONS: It is necessary to customize de-identification models using local clinical text and other resources when applied in cross-institute settings. Fine-tuning is a potential solution to re-use pre-trained parameters and reduce the training time to customize deep learning-based de-identification models trained using clinical corpus from a different institution.
format Online
Article
Text
id pubmed-6894104
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-68941042019-12-11 A study of deep learning methods for de-identification of clinical notes in cross-institute settings Yang, Xi Lyu, Tianchen Li, Qian Lee, Chih-Yin Bian, Jiang Hogan, William R. Wu, Yonghui BMC Med Inform Decis Mak Research BACKGROUND: De-identification is a critical technology to facilitate the use of unstructured clinical text while protecting patient privacy and confidentiality. The clinical natural language processing (NLP) community has invested great efforts in developing methods and corpora for de-identification of clinical notes. These annotated corpora are valuable resources for developing automated systems to de-identify clinical text at local hospitals. However, existing studies often utilized training and test data collected from the same institution. There are few studies to explore automated de-identification under cross-institute settings. The goal of this study is to examine deep learning-based de-identification methods at a cross-institute setting, identify the bottlenecks, and provide potential solutions. METHODS: We created a de-identification corpus using a total 500 clinical notes from the University of Florida (UF) Health, developed deep learning-based de-identification models using 2014 i2b2/UTHealth corpus, and evaluated the performance using UF corpus. We compared five different word embeddings trained from the general English text, clinical text, and biomedical literature, explored lexical and linguistic features, and compared two strategies to customize the deep learning models using UF notes and resources. RESULTS: Pre-trained word embeddings using a general English corpus achieved better performance than embeddings from de-identified clinical text and biomedical literature. The performance of deep learning models trained using only i2b2 corpus significantly dropped (strict and relax F1 scores dropped from 0.9547 and 0.9646 to 0.8568 and 0.8958) when applied to another corpus annotated at UF Health. Linguistic features could further improve the performance of de-identification in cross-institute settings. After customizing the models using UF notes and resource, the best model achieved the strict and relaxed F1 scores of 0.9288 and 0.9584, respectively. CONCLUSIONS: It is necessary to customize de-identification models using local clinical text and other resources when applied in cross-institute settings. Fine-tuning is a potential solution to re-use pre-trained parameters and reduce the training time to customize deep learning-based de-identification models trained using clinical corpus from a different institution. BioMed Central 2019-12-05 /pmc/articles/PMC6894104/ /pubmed/31801524 http://dx.doi.org/10.1186/s12911-019-0935-4 Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Yang, Xi
Lyu, Tianchen
Li, Qian
Lee, Chih-Yin
Bian, Jiang
Hogan, William R.
Wu, Yonghui
A study of deep learning methods for de-identification of clinical notes in cross-institute settings
title A study of deep learning methods for de-identification of clinical notes in cross-institute settings
title_full A study of deep learning methods for de-identification of clinical notes in cross-institute settings
title_fullStr A study of deep learning methods for de-identification of clinical notes in cross-institute settings
title_full_unstemmed A study of deep learning methods for de-identification of clinical notes in cross-institute settings
title_short A study of deep learning methods for de-identification of clinical notes in cross-institute settings
title_sort study of deep learning methods for de-identification of clinical notes in cross-institute settings
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6894104/
https://www.ncbi.nlm.nih.gov/pubmed/31801524
http://dx.doi.org/10.1186/s12911-019-0935-4
work_keys_str_mv AT yangxi astudyofdeeplearningmethodsfordeidentificationofclinicalnotesincrossinstitutesettings
AT lyutianchen astudyofdeeplearningmethodsfordeidentificationofclinicalnotesincrossinstitutesettings
AT liqian astudyofdeeplearningmethodsfordeidentificationofclinicalnotesincrossinstitutesettings
AT leechihyin astudyofdeeplearningmethodsfordeidentificationofclinicalnotesincrossinstitutesettings
AT bianjiang astudyofdeeplearningmethodsfordeidentificationofclinicalnotesincrossinstitutesettings
AT hoganwilliamr astudyofdeeplearningmethodsfordeidentificationofclinicalnotesincrossinstitutesettings
AT wuyonghui astudyofdeeplearningmethodsfordeidentificationofclinicalnotesincrossinstitutesettings
AT yangxi studyofdeeplearningmethodsfordeidentificationofclinicalnotesincrossinstitutesettings
AT lyutianchen studyofdeeplearningmethodsfordeidentificationofclinicalnotesincrossinstitutesettings
AT liqian studyofdeeplearningmethodsfordeidentificationofclinicalnotesincrossinstitutesettings
AT leechihyin studyofdeeplearningmethodsfordeidentificationofclinicalnotesincrossinstitutesettings
AT bianjiang studyofdeeplearningmethodsfordeidentificationofclinicalnotesincrossinstitutesettings
AT hoganwilliamr studyofdeeplearningmethodsfordeidentificationofclinicalnotesincrossinstitutesettings
AT wuyonghui studyofdeeplearningmethodsfordeidentificationofclinicalnotesincrossinstitutesettings