Cargando…

An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation

BACKGROUND: With the popularization of electronic health records in China, the utilization of digitalized data has great potential for the development of real-world medical research. However, the data usually contains a great deal of protected health information and the direct usage of this data may...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wang, Peng, Li, Yong, Yang, Liang, Li, Simin, Li, Linfeng, Zhao, Zehan, Long, Shaopei, Wang, Fei, Wang, Hongqian, Li, Ying, Wang, Chengliang
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	JMIR Publications 2022
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9472063/ https://www.ncbi.nlm.nih.gov/pubmed/36040774 http://dx.doi.org/10.2196/38154

_version_	1784789225269362688
author	Wang, Peng Li, Yong Yang, Liang Li, Simin Li, Linfeng Zhao, Zehan Long, Shaopei Wang, Fei Wang, Hongqian Li, Ying Wang, Chengliang
author_facet	Wang, Peng Li, Yong Yang, Liang Li, Simin Li, Linfeng Zhao, Zehan Long, Shaopei Wang, Fei Wang, Hongqian Li, Ying Wang, Chengliang
author_sort	Wang, Peng
collection	PubMed
description	BACKGROUND: With the popularization of electronic health records in China, the utilization of digitalized data has great potential for the development of real-world medical research. However, the data usually contains a great deal of protected health information and the direct usage of this data may cause privacy issues. The task of deidentifying protected health information in electronic health records can be regarded as a named entity recognition problem. Existing rule-based, machine learning–based, or deep learning–based methods have been proposed to solve this problem. However, these methods still face the difficulties of insufficient Chinese electronic health record data and the complex features of the Chinese language. OBJECTIVE: This paper proposes a method to overcome the difficulties of overfitting and a lack of training data for deep neural networks to enable Chinese protected health information deidentification. METHODS: We propose a new model that merges TinyBERT (bidirectional encoder representations from transformers) as a text feature extraction module and the conditional random field method as a prediction module for deidentifying protected health information in Chinese medical electronic health records. In addition, a hybrid data augmentation method that integrates a sentence generation strategy and a mention-replacement strategy is proposed for overcoming insufficient Chinese electronic health records. RESULTS: We compare our method with 5 baseline methods that utilize different BERT models as their feature extraction modules. Experimental results on the Chinese electronic health records that we collected demonstrate that our method had better performance (microprecision: 98.7%, microrecall: 99.13%, and micro-F1 score: 98.91%) and higher efficiency (40% faster) than all the BERT-based baseline methods. CONCLUSIONS: Compared to baseline methods, the efficiency advantage of TinyBERT on our proposed augmented data set was kept while the performance improved for the task of Chinese protected health information deidentification.
format	Online Article Text
id	pubmed-9472063
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	JMIR Publications
record_format	MEDLINE/PubMed
spelling	pubmed-94720632022-09-15 An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation Wang, Peng Li, Yong Yang, Liang Li, Simin Li, Linfeng Zhao, Zehan Long, Shaopei Wang, Fei Wang, Hongqian Li, Ying Wang, Chengliang JMIR Med Inform Original Paper BACKGROUND: With the popularization of electronic health records in China, the utilization of digitalized data has great potential for the development of real-world medical research. However, the data usually contains a great deal of protected health information and the direct usage of this data may cause privacy issues. The task of deidentifying protected health information in electronic health records can be regarded as a named entity recognition problem. Existing rule-based, machine learning–based, or deep learning–based methods have been proposed to solve this problem. However, these methods still face the difficulties of insufficient Chinese electronic health record data and the complex features of the Chinese language. OBJECTIVE: This paper proposes a method to overcome the difficulties of overfitting and a lack of training data for deep neural networks to enable Chinese protected health information deidentification. METHODS: We propose a new model that merges TinyBERT (bidirectional encoder representations from transformers) as a text feature extraction module and the conditional random field method as a prediction module for deidentifying protected health information in Chinese medical electronic health records. In addition, a hybrid data augmentation method that integrates a sentence generation strategy and a mention-replacement strategy is proposed for overcoming insufficient Chinese electronic health records. RESULTS: We compare our method with 5 baseline methods that utilize different BERT models as their feature extraction modules. Experimental results on the Chinese electronic health records that we collected demonstrate that our method had better performance (microprecision: 98.7%, microrecall: 99.13%, and micro-F1 score: 98.91%) and higher efficiency (40% faster) than all the BERT-based baseline methods. CONCLUSIONS: Compared to baseline methods, the efficiency advantage of TinyBERT on our proposed augmented data set was kept while the performance improved for the task of Chinese protected health information deidentification. JMIR Publications 2022-08-30 /pmc/articles/PMC9472063/ /pubmed/36040774 http://dx.doi.org/10.2196/38154 Text en ©Peng Wang, Yong Li, Liang Yang, Simin Li, Linfeng Li, Zehan Zhao, Shaopei Long, Fei Wang, Hongqian Wang, Ying Li, Chengliang Wang. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 30.08.2022. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
spellingShingle	Original Paper Wang, Peng Li, Yong Yang, Liang Li, Simin Li, Linfeng Zhao, Zehan Long, Shaopei Wang, Fei Wang, Hongqian Li, Ying Wang, Chengliang An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation
title	An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation
title_full	An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation
title_fullStr	An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation
title_full_unstemmed	An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation
title_short	An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation
title_sort	efficient method for deidentifying protected health information in chinese electronic health records: algorithm development and validation
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9472063/ https://www.ncbi.nlm.nih.gov/pubmed/36040774 http://dx.doi.org/10.2196/38154
work_keys_str_mv	AT wangpeng anefficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT liyong anefficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT yangliang anefficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT lisimin anefficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT lilinfeng anefficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT zhaozehan anefficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT longshaopei anefficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT wangfei anefficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT wanghongqian anefficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT liying anefficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT wangchengliang anefficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT wangpeng efficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT liyong efficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT yangliang efficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT lisimin efficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT lilinfeng efficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT zhaozehan efficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT longshaopei efficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT wangfei efficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT wanghongqian efficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT liying efficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT wangchengliang efficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation

An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation

Ejemplares similares