Cargando…

Using Character-Level and Entity-Level Representations to Enhance Bidirectional Encoder Representation From Transformers-Based Clinical Semantic Textual Similarity Model: ClinicalSTS Modeling Study

BACKGROUND: With the popularity of electronic health records (EHRs), the quality of health care has been improved. However, there are also some problems caused by EHRs, such as the growing use of copy-and-paste and templates, resulting in EHRs of low quality in content. In order to minimize data red...

Descripción completa

Detalles Bibliográficos
Autores principales: Xiong, Ying, Chen, Shuai, Chen, Qingcai, Yan, Jun, Tang, Buzhou
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7803475/
https://www.ncbi.nlm.nih.gov/pubmed/33372664
http://dx.doi.org/10.2196/23357
_version_ 1783635945040379904
author Xiong, Ying
Chen, Shuai
Chen, Qingcai
Yan, Jun
Tang, Buzhou
author_facet Xiong, Ying
Chen, Shuai
Chen, Qingcai
Yan, Jun
Tang, Buzhou
author_sort Xiong, Ying
collection PubMed
description BACKGROUND: With the popularity of electronic health records (EHRs), the quality of health care has been improved. However, there are also some problems caused by EHRs, such as the growing use of copy-and-paste and templates, resulting in EHRs of low quality in content. In order to minimize data redundancy in different documents, Harvard Medical School and Mayo Clinic organized a national natural language processing (NLP) clinical challenge (n2c2) on clinical semantic textual similarity (ClinicalSTS) in 2019. The task of this challenge is to compute the semantic similarity among clinical text snippets. OBJECTIVE: In this study, we aim to investigate novel methods to model ClinicalSTS and analyze the results. METHODS: We propose a semantically enhanced text matching model for the 2019 n2c2/Open Health NLP (OHNLP) challenge on ClinicalSTS. The model includes 3 representation modules to encode clinical text snippet pairs at different levels: (1) character-level representation module based on convolutional neural network (CNN) to tackle the out-of-vocabulary problem in NLP; (2) sentence-level representation module that adopts a pretrained language model bidirectional encoder representation from transformers (BERT) to encode clinical text snippet pairs; and (3) entity-level representation module to model clinical entity information in clinical text snippets. In the case of entity-level representation, we compare 2 methods. One encodes entities by the entity-type label sequence corresponding to text snippet (called entity I), whereas the other encodes entities by their representation in MeSH, a knowledge graph in the medical domain (called entity II). RESULTS: We conduct experiments on the ClinicalSTS corpus of the 2019 n2c2/OHNLP challenge for model performance evaluation. The model only using BERT for text snippet pair encoding achieved a Pearson correlation coefficient (PCC) of 0.848. When character-level representation and entity-level representation are individually added into our model, the PCC increased to 0.857 and 0.854 (entity I)/0.859 (entity II), respectively. When both character-level representation and entity-level representation are added into our model, the PCC further increased to 0.861 (entity I) and 0.868 (entity II). CONCLUSIONS: Experimental results show that both character-level information and entity-level information can effectively enhance the BERT-based STS model.
format Online
Article
Text
id pubmed-7803475
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher JMIR Publications
record_format MEDLINE/PubMed
spelling pubmed-78034752021-01-15 Using Character-Level and Entity-Level Representations to Enhance Bidirectional Encoder Representation From Transformers-Based Clinical Semantic Textual Similarity Model: ClinicalSTS Modeling Study Xiong, Ying Chen, Shuai Chen, Qingcai Yan, Jun Tang, Buzhou JMIR Med Inform Original Paper BACKGROUND: With the popularity of electronic health records (EHRs), the quality of health care has been improved. However, there are also some problems caused by EHRs, such as the growing use of copy-and-paste and templates, resulting in EHRs of low quality in content. In order to minimize data redundancy in different documents, Harvard Medical School and Mayo Clinic organized a national natural language processing (NLP) clinical challenge (n2c2) on clinical semantic textual similarity (ClinicalSTS) in 2019. The task of this challenge is to compute the semantic similarity among clinical text snippets. OBJECTIVE: In this study, we aim to investigate novel methods to model ClinicalSTS and analyze the results. METHODS: We propose a semantically enhanced text matching model for the 2019 n2c2/Open Health NLP (OHNLP) challenge on ClinicalSTS. The model includes 3 representation modules to encode clinical text snippet pairs at different levels: (1) character-level representation module based on convolutional neural network (CNN) to tackle the out-of-vocabulary problem in NLP; (2) sentence-level representation module that adopts a pretrained language model bidirectional encoder representation from transformers (BERT) to encode clinical text snippet pairs; and (3) entity-level representation module to model clinical entity information in clinical text snippets. In the case of entity-level representation, we compare 2 methods. One encodes entities by the entity-type label sequence corresponding to text snippet (called entity I), whereas the other encodes entities by their representation in MeSH, a knowledge graph in the medical domain (called entity II). RESULTS: We conduct experiments on the ClinicalSTS corpus of the 2019 n2c2/OHNLP challenge for model performance evaluation. The model only using BERT for text snippet pair encoding achieved a Pearson correlation coefficient (PCC) of 0.848. When character-level representation and entity-level representation are individually added into our model, the PCC increased to 0.857 and 0.854 (entity I)/0.859 (entity II), respectively. When both character-level representation and entity-level representation are added into our model, the PCC further increased to 0.861 (entity I) and 0.868 (entity II). CONCLUSIONS: Experimental results show that both character-level information and entity-level information can effectively enhance the BERT-based STS model. JMIR Publications 2020-12-29 /pmc/articles/PMC7803475/ /pubmed/33372664 http://dx.doi.org/10.2196/23357 Text en ©Ying Xiong, Shuai Chen, Qingcai Chen, Jun Yan, Buzhou Tang. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 29.12.2020. https://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
spellingShingle Original Paper
Xiong, Ying
Chen, Shuai
Chen, Qingcai
Yan, Jun
Tang, Buzhou
Using Character-Level and Entity-Level Representations to Enhance Bidirectional Encoder Representation From Transformers-Based Clinical Semantic Textual Similarity Model: ClinicalSTS Modeling Study
title Using Character-Level and Entity-Level Representations to Enhance Bidirectional Encoder Representation From Transformers-Based Clinical Semantic Textual Similarity Model: ClinicalSTS Modeling Study
title_full Using Character-Level and Entity-Level Representations to Enhance Bidirectional Encoder Representation From Transformers-Based Clinical Semantic Textual Similarity Model: ClinicalSTS Modeling Study
title_fullStr Using Character-Level and Entity-Level Representations to Enhance Bidirectional Encoder Representation From Transformers-Based Clinical Semantic Textual Similarity Model: ClinicalSTS Modeling Study
title_full_unstemmed Using Character-Level and Entity-Level Representations to Enhance Bidirectional Encoder Representation From Transformers-Based Clinical Semantic Textual Similarity Model: ClinicalSTS Modeling Study
title_short Using Character-Level and Entity-Level Representations to Enhance Bidirectional Encoder Representation From Transformers-Based Clinical Semantic Textual Similarity Model: ClinicalSTS Modeling Study
title_sort using character-level and entity-level representations to enhance bidirectional encoder representation from transformers-based clinical semantic textual similarity model: clinicalsts modeling study
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7803475/
https://www.ncbi.nlm.nih.gov/pubmed/33372664
http://dx.doi.org/10.2196/23357
work_keys_str_mv AT xiongying usingcharacterlevelandentitylevelrepresentationstoenhancebidirectionalencoderrepresentationfromtransformersbasedclinicalsemantictextualsimilaritymodelclinicalstsmodelingstudy
AT chenshuai usingcharacterlevelandentitylevelrepresentationstoenhancebidirectionalencoderrepresentationfromtransformersbasedclinicalsemantictextualsimilaritymodelclinicalstsmodelingstudy
AT chenqingcai usingcharacterlevelandentitylevelrepresentationstoenhancebidirectionalencoderrepresentationfromtransformersbasedclinicalsemantictextualsimilaritymodelclinicalstsmodelingstudy
AT yanjun usingcharacterlevelandentitylevelrepresentationstoenhancebidirectionalencoderrepresentationfromtransformersbasedclinicalsemantictextualsimilaritymodelclinicalstsmodelingstudy
AT tangbuzhou usingcharacterlevelandentitylevelrepresentationstoenhancebidirectionalencoderrepresentationfromtransformersbasedclinicalsemantictextualsimilaritymodelclinicalstsmodelingstudy