Cargando…

Protected Health Information Recognition by Fine-Tuning a Pre-training Transformer Model

OBJECTIVES: De-identifying protected health information (PHI) in medical documents is important, and a prerequisite to de-identification is the identification of PHI entity names in clinical documents. This study aimed to compare the performance of three pre-training models that have recently attrac...

Descripción completa

Detalles Bibliográficos
Autores principales:	Oh, Seo Hyun, Kang, Min, Lee, Youngho
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Korean Society of Medical Informatics 2022
Materias:	Original Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8850174/ https://www.ncbi.nlm.nih.gov/pubmed/35172087 http://dx.doi.org/10.4258/hir.2022.28.1.16

Descripción
Sumario:	OBJECTIVES: De-identifying protected health information (PHI) in medical documents is important, and a prerequisite to de-identification is the identification of PHI entity names in clinical documents. This study aimed to compare the performance of three pre-training models that have recently attracted significant attention and to determine which model is more suitable for PHI recognition. METHODS: We compared the PHI recognition performance of deep learning models using the i2b2 2014 dataset. We used the three pre-training models—namely, bidirectional encoder representations from transformers (BERT), robustly optimized BERT pre-training approach (RoBERTa), and XLNet (model built based on Transformer-XL)—to detect PHI. After the dataset was tokenized, it was processed using an inside-outside-beginning tagging scheme and WordPiece-tokenized to place it into these models. Further, the PHI recognition performance was investigated using BERT, RoBERTa, and XLNet. RESULTS: Comparing the PHI recognition performance of the three models, it was confirmed that XLNet had a superior F1-score of 96.29%. In addition, when checking PHI entity performance evaluation, RoBERTa and XLNet showed a 30% improvement in performance compared to BERT. CONCLUSIONS: Among the pre-training models used in this study, XLNet exhibited superior performance because word embedding was well constructed using the two-stream self-attention method. In addition, compared to BERT, RoBERTa and XLNet showed superior performance, indicating that they were more effective in grasping the context.

Protected Health Information Recognition by Fine-Tuning a Pre-training Transformer Model

Ejemplares similares