Cargando…
Protected Health Information Recognition by Fine-Tuning a Pre-training Transformer Model
OBJECTIVES: De-identifying protected health information (PHI) in medical documents is important, and a prerequisite to de-identification is the identification of PHI entity names in clinical documents. This study aimed to compare the performance of three pre-training models that have recently attrac...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Korean Society of Medical Informatics
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8850174/ https://www.ncbi.nlm.nih.gov/pubmed/35172087 http://dx.doi.org/10.4258/hir.2022.28.1.16 |
_version_ | 1784652534787342336 |
---|---|
author | Oh, Seo Hyun Kang, Min Lee, Youngho |
author_facet | Oh, Seo Hyun Kang, Min Lee, Youngho |
author_sort | Oh, Seo Hyun |
collection | PubMed |
description | OBJECTIVES: De-identifying protected health information (PHI) in medical documents is important, and a prerequisite to de-identification is the identification of PHI entity names in clinical documents. This study aimed to compare the performance of three pre-training models that have recently attracted significant attention and to determine which model is more suitable for PHI recognition. METHODS: We compared the PHI recognition performance of deep learning models using the i2b2 2014 dataset. We used the three pre-training models—namely, bidirectional encoder representations from transformers (BERT), robustly optimized BERT pre-training approach (RoBERTa), and XLNet (model built based on Transformer-XL)—to detect PHI. After the dataset was tokenized, it was processed using an inside-outside-beginning tagging scheme and WordPiece-tokenized to place it into these models. Further, the PHI recognition performance was investigated using BERT, RoBERTa, and XLNet. RESULTS: Comparing the PHI recognition performance of the three models, it was confirmed that XLNet had a superior F1-score of 96.29%. In addition, when checking PHI entity performance evaluation, RoBERTa and XLNet showed a 30% improvement in performance compared to BERT. CONCLUSIONS: Among the pre-training models used in this study, XLNet exhibited superior performance because word embedding was well constructed using the two-stream self-attention method. In addition, compared to BERT, RoBERTa and XLNet showed superior performance, indicating that they were more effective in grasping the context. |
format | Online Article Text |
id | pubmed-8850174 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Korean Society of Medical Informatics |
record_format | MEDLINE/PubMed |
spelling | pubmed-88501742022-02-26 Protected Health Information Recognition by Fine-Tuning a Pre-training Transformer Model Oh, Seo Hyun Kang, Min Lee, Youngho Healthc Inform Res Original Article OBJECTIVES: De-identifying protected health information (PHI) in medical documents is important, and a prerequisite to de-identification is the identification of PHI entity names in clinical documents. This study aimed to compare the performance of three pre-training models that have recently attracted significant attention and to determine which model is more suitable for PHI recognition. METHODS: We compared the PHI recognition performance of deep learning models using the i2b2 2014 dataset. We used the three pre-training models—namely, bidirectional encoder representations from transformers (BERT), robustly optimized BERT pre-training approach (RoBERTa), and XLNet (model built based on Transformer-XL)—to detect PHI. After the dataset was tokenized, it was processed using an inside-outside-beginning tagging scheme and WordPiece-tokenized to place it into these models. Further, the PHI recognition performance was investigated using BERT, RoBERTa, and XLNet. RESULTS: Comparing the PHI recognition performance of the three models, it was confirmed that XLNet had a superior F1-score of 96.29%. In addition, when checking PHI entity performance evaluation, RoBERTa and XLNet showed a 30% improvement in performance compared to BERT. CONCLUSIONS: Among the pre-training models used in this study, XLNet exhibited superior performance because word embedding was well constructed using the two-stream self-attention method. In addition, compared to BERT, RoBERTa and XLNet showed superior performance, indicating that they were more effective in grasping the context. Korean Society of Medical Informatics 2022-01 2022-01-31 /pmc/articles/PMC8850174/ /pubmed/35172087 http://dx.doi.org/10.4258/hir.2022.28.1.16 Text en © 2022 The Korean Society of Medical Informatics https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) ) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Article Oh, Seo Hyun Kang, Min Lee, Youngho Protected Health Information Recognition by Fine-Tuning a Pre-training Transformer Model |
title | Protected Health Information Recognition by Fine-Tuning a Pre-training Transformer Model |
title_full | Protected Health Information Recognition by Fine-Tuning a Pre-training Transformer Model |
title_fullStr | Protected Health Information Recognition by Fine-Tuning a Pre-training Transformer Model |
title_full_unstemmed | Protected Health Information Recognition by Fine-Tuning a Pre-training Transformer Model |
title_short | Protected Health Information Recognition by Fine-Tuning a Pre-training Transformer Model |
title_sort | protected health information recognition by fine-tuning a pre-training transformer model |
topic | Original Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8850174/ https://www.ncbi.nlm.nih.gov/pubmed/35172087 http://dx.doi.org/10.4258/hir.2022.28.1.16 |
work_keys_str_mv | AT ohseohyun protectedhealthinformationrecognitionbyfinetuningapretrainingtransformermodel AT kangmin protectedhealthinformationrecognitionbyfinetuningapretrainingtransformermodel AT leeyoungho protectedhealthinformationrecognitionbyfinetuningapretrainingtransformermodel |