Cargando…

De-identifying free text of Japanese electronic health records

BACKGROUND: Recently, more electronic data sources are becoming available in the healthcare domain. Electronic health records (EHRs), with their vast amounts of potentially available data, can greatly improve healthcare. Although EHR de-identification is necessary to protect personal information, au...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kajiyama, Kohei, Horiguchi, Hiromasa, Okumura, Takashi, Morita, Mizuki, Kano, Yoshinobu
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2020
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7504663/ https://www.ncbi.nlm.nih.gov/pubmed/32958039 http://dx.doi.org/10.1186/s13326-020-00227-9

_version_	1783584675646668800
author	Kajiyama, Kohei Horiguchi, Hiromasa Okumura, Takashi Morita, Mizuki Kano, Yoshinobu
author_facet	Kajiyama, Kohei Horiguchi, Hiromasa Okumura, Takashi Morita, Mizuki Kano, Yoshinobu
author_sort	Kajiyama, Kohei
collection	PubMed
description	BACKGROUND: Recently, more electronic data sources are becoming available in the healthcare domain. Electronic health records (EHRs), with their vast amounts of potentially available data, can greatly improve healthcare. Although EHR de-identification is necessary to protect personal information, automatic de-identification of Japanese language EHRs has not been studied sufficiently. This study was conducted to raise de-identification performance for Japanese EHRs through classic machine learning, deep learning, and rule-based methods, depending on the dataset. RESULTS: Using three datasets, we implemented de-identification systems for Japanese EHRs and compared the de-identification performances found for rule-based, Conditional Random Fields (CRF), and Long-Short Term Memory (LSTM)-based methods. Gold standard tags for de-identification are annotated manually for age, hospital, person, sex, and time. We used different combinations of our datasets to train and evaluate our three methods. Our best F1-scores were 84.23, 68.19, and 81.67 points, respectively, for evaluations of the MedNLP dataset, a dummy EHR dataset that was virtually written by a medical doctor, and a Pathology Report dataset. Our LSTM-based method was the best performing, except for the MedNLP dataset. The rule-based method was best for the MedNLP dataset. The LSTM-based method achieved a good score of 83.07 points for this MedNLP dataset, which differs by 1.16 points from the best score obtained using the rule-based method. Results suggest that LSTM adapted well to different characteristics of our datasets. Our LSTM-based method performed better than our CRF-based method, yielding a 7.41 point F1-score, when applied to our Pathology Report dataset. This report is the first of study applying this LSTM-based method to any de-identification task of a Japanese EHR. CONCLUSIONS: Our LSTM-based machine learning method was able to extract named entities to be de-identified with better performance, in general, than that of our rule-based methods. However, machine learning methods are inadequate for processing expressions with low occurrence. Our future work will specifically examine the combination of LSTM and rule-based methods to achieve better performance. Our currently achieved level of performance is sufficiently higher than that of publicly available Japanese de-identification tools. Therefore, our system will be applied to actual de-identification tasks in hospitals.
format	Online Article Text
id	pubmed-7504663
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-75046632020-09-23 De-identifying free text of Japanese electronic health records Kajiyama, Kohei Horiguchi, Hiromasa Okumura, Takashi Morita, Mizuki Kano, Yoshinobu J Biomed Semantics Research BACKGROUND: Recently, more electronic data sources are becoming available in the healthcare domain. Electronic health records (EHRs), with their vast amounts of potentially available data, can greatly improve healthcare. Although EHR de-identification is necessary to protect personal information, automatic de-identification of Japanese language EHRs has not been studied sufficiently. This study was conducted to raise de-identification performance for Japanese EHRs through classic machine learning, deep learning, and rule-based methods, depending on the dataset. RESULTS: Using three datasets, we implemented de-identification systems for Japanese EHRs and compared the de-identification performances found for rule-based, Conditional Random Fields (CRF), and Long-Short Term Memory (LSTM)-based methods. Gold standard tags for de-identification are annotated manually for age, hospital, person, sex, and time. We used different combinations of our datasets to train and evaluate our three methods. Our best F1-scores were 84.23, 68.19, and 81.67 points, respectively, for evaluations of the MedNLP dataset, a dummy EHR dataset that was virtually written by a medical doctor, and a Pathology Report dataset. Our LSTM-based method was the best performing, except for the MedNLP dataset. The rule-based method was best for the MedNLP dataset. The LSTM-based method achieved a good score of 83.07 points for this MedNLP dataset, which differs by 1.16 points from the best score obtained using the rule-based method. Results suggest that LSTM adapted well to different characteristics of our datasets. Our LSTM-based method performed better than our CRF-based method, yielding a 7.41 point F1-score, when applied to our Pathology Report dataset. This report is the first of study applying this LSTM-based method to any de-identification task of a Japanese EHR. CONCLUSIONS: Our LSTM-based machine learning method was able to extract named entities to be de-identified with better performance, in general, than that of our rule-based methods. However, machine learning methods are inadequate for processing expressions with low occurrence. Our future work will specifically examine the combination of LSTM and rule-based methods to achieve better performance. Our currently achieved level of performance is sufficiently higher than that of publicly available Japanese de-identification tools. Therefore, our system will be applied to actual de-identification tasks in hospitals. BioMed Central 2020-09-21 /pmc/articles/PMC7504663/ /pubmed/32958039 http://dx.doi.org/10.1186/s13326-020-00227-9 Text en © The Author(s) 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Kajiyama, Kohei Horiguchi, Hiromasa Okumura, Takashi Morita, Mizuki Kano, Yoshinobu De-identifying free text of Japanese electronic health records
title	De-identifying free text of Japanese electronic health records
title_full	De-identifying free text of Japanese electronic health records
title_fullStr	De-identifying free text of Japanese electronic health records
title_full_unstemmed	De-identifying free text of Japanese electronic health records
title_short	De-identifying free text of Japanese electronic health records
title_sort	de-identifying free text of japanese electronic health records
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7504663/ https://www.ncbi.nlm.nih.gov/pubmed/32958039 http://dx.doi.org/10.1186/s13326-020-00227-9
work_keys_str_mv	AT kajiyamakohei deidentifyingfreetextofjapaneseelectronichealthrecords AT horiguchihiromasa deidentifyingfreetextofjapaneseelectronichealthrecords AT okumuratakashi deidentifyingfreetextofjapaneseelectronichealthrecords AT moritamizuki deidentifyingfreetextofjapaneseelectronichealthrecords AT kanoyoshinobu deidentifyingfreetextofjapaneseelectronichealthrecords

De-identifying free text of Japanese electronic health records

Ejemplares similares