Cargando…
Using Character-Level and Entity-Level Representations to Enhance Bidirectional Encoder Representation From Transformers-Based Clinical Semantic Textual Similarity Model: ClinicalSTS Modeling Study
BACKGROUND: With the popularity of electronic health records (EHRs), the quality of health care has been improved. However, there are also some problems caused by EHRs, such as the growing use of copy-and-paste and templates, resulting in EHRs of low quality in content. In order to minimize data red...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
JMIR Publications
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7803475/ https://www.ncbi.nlm.nih.gov/pubmed/33372664 http://dx.doi.org/10.2196/23357 |
_version_ | 1783635945040379904 |
---|---|
author | Xiong, Ying Chen, Shuai Chen, Qingcai Yan, Jun Tang, Buzhou |
author_facet | Xiong, Ying Chen, Shuai Chen, Qingcai Yan, Jun Tang, Buzhou |
author_sort | Xiong, Ying |
collection | PubMed |
description | BACKGROUND: With the popularity of electronic health records (EHRs), the quality of health care has been improved. However, there are also some problems caused by EHRs, such as the growing use of copy-and-paste and templates, resulting in EHRs of low quality in content. In order to minimize data redundancy in different documents, Harvard Medical School and Mayo Clinic organized a national natural language processing (NLP) clinical challenge (n2c2) on clinical semantic textual similarity (ClinicalSTS) in 2019. The task of this challenge is to compute the semantic similarity among clinical text snippets. OBJECTIVE: In this study, we aim to investigate novel methods to model ClinicalSTS and analyze the results. METHODS: We propose a semantically enhanced text matching model for the 2019 n2c2/Open Health NLP (OHNLP) challenge on ClinicalSTS. The model includes 3 representation modules to encode clinical text snippet pairs at different levels: (1) character-level representation module based on convolutional neural network (CNN) to tackle the out-of-vocabulary problem in NLP; (2) sentence-level representation module that adopts a pretrained language model bidirectional encoder representation from transformers (BERT) to encode clinical text snippet pairs; and (3) entity-level representation module to model clinical entity information in clinical text snippets. In the case of entity-level representation, we compare 2 methods. One encodes entities by the entity-type label sequence corresponding to text snippet (called entity I), whereas the other encodes entities by their representation in MeSH, a knowledge graph in the medical domain (called entity II). RESULTS: We conduct experiments on the ClinicalSTS corpus of the 2019 n2c2/OHNLP challenge for model performance evaluation. The model only using BERT for text snippet pair encoding achieved a Pearson correlation coefficient (PCC) of 0.848. When character-level representation and entity-level representation are individually added into our model, the PCC increased to 0.857 and 0.854 (entity I)/0.859 (entity II), respectively. When both character-level representation and entity-level representation are added into our model, the PCC further increased to 0.861 (entity I) and 0.868 (entity II). CONCLUSIONS: Experimental results show that both character-level information and entity-level information can effectively enhance the BERT-based STS model. |
format | Online Article Text |
id | pubmed-7803475 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | JMIR Publications |
record_format | MEDLINE/PubMed |
spelling | pubmed-78034752021-01-15 Using Character-Level and Entity-Level Representations to Enhance Bidirectional Encoder Representation From Transformers-Based Clinical Semantic Textual Similarity Model: ClinicalSTS Modeling Study Xiong, Ying Chen, Shuai Chen, Qingcai Yan, Jun Tang, Buzhou JMIR Med Inform Original Paper BACKGROUND: With the popularity of electronic health records (EHRs), the quality of health care has been improved. However, there are also some problems caused by EHRs, such as the growing use of copy-and-paste and templates, resulting in EHRs of low quality in content. In order to minimize data redundancy in different documents, Harvard Medical School and Mayo Clinic organized a national natural language processing (NLP) clinical challenge (n2c2) on clinical semantic textual similarity (ClinicalSTS) in 2019. The task of this challenge is to compute the semantic similarity among clinical text snippets. OBJECTIVE: In this study, we aim to investigate novel methods to model ClinicalSTS and analyze the results. METHODS: We propose a semantically enhanced text matching model for the 2019 n2c2/Open Health NLP (OHNLP) challenge on ClinicalSTS. The model includes 3 representation modules to encode clinical text snippet pairs at different levels: (1) character-level representation module based on convolutional neural network (CNN) to tackle the out-of-vocabulary problem in NLP; (2) sentence-level representation module that adopts a pretrained language model bidirectional encoder representation from transformers (BERT) to encode clinical text snippet pairs; and (3) entity-level representation module to model clinical entity information in clinical text snippets. In the case of entity-level representation, we compare 2 methods. One encodes entities by the entity-type label sequence corresponding to text snippet (called entity I), whereas the other encodes entities by their representation in MeSH, a knowledge graph in the medical domain (called entity II). RESULTS: We conduct experiments on the ClinicalSTS corpus of the 2019 n2c2/OHNLP challenge for model performance evaluation. The model only using BERT for text snippet pair encoding achieved a Pearson correlation coefficient (PCC) of 0.848. When character-level representation and entity-level representation are individually added into our model, the PCC increased to 0.857 and 0.854 (entity I)/0.859 (entity II), respectively. When both character-level representation and entity-level representation are added into our model, the PCC further increased to 0.861 (entity I) and 0.868 (entity II). CONCLUSIONS: Experimental results show that both character-level information and entity-level information can effectively enhance the BERT-based STS model. JMIR Publications 2020-12-29 /pmc/articles/PMC7803475/ /pubmed/33372664 http://dx.doi.org/10.2196/23357 Text en ©Ying Xiong, Shuai Chen, Qingcai Chen, Jun Yan, Buzhou Tang. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 29.12.2020. https://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included. |
spellingShingle | Original Paper Xiong, Ying Chen, Shuai Chen, Qingcai Yan, Jun Tang, Buzhou Using Character-Level and Entity-Level Representations to Enhance Bidirectional Encoder Representation From Transformers-Based Clinical Semantic Textual Similarity Model: ClinicalSTS Modeling Study |
title | Using Character-Level and Entity-Level Representations to Enhance Bidirectional Encoder Representation From Transformers-Based Clinical Semantic Textual Similarity Model: ClinicalSTS Modeling Study |
title_full | Using Character-Level and Entity-Level Representations to Enhance Bidirectional Encoder Representation From Transformers-Based Clinical Semantic Textual Similarity Model: ClinicalSTS Modeling Study |
title_fullStr | Using Character-Level and Entity-Level Representations to Enhance Bidirectional Encoder Representation From Transformers-Based Clinical Semantic Textual Similarity Model: ClinicalSTS Modeling Study |
title_full_unstemmed | Using Character-Level and Entity-Level Representations to Enhance Bidirectional Encoder Representation From Transformers-Based Clinical Semantic Textual Similarity Model: ClinicalSTS Modeling Study |
title_short | Using Character-Level and Entity-Level Representations to Enhance Bidirectional Encoder Representation From Transformers-Based Clinical Semantic Textual Similarity Model: ClinicalSTS Modeling Study |
title_sort | using character-level and entity-level representations to enhance bidirectional encoder representation from transformers-based clinical semantic textual similarity model: clinicalsts modeling study |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7803475/ https://www.ncbi.nlm.nih.gov/pubmed/33372664 http://dx.doi.org/10.2196/23357 |
work_keys_str_mv | AT xiongying usingcharacterlevelandentitylevelrepresentationstoenhancebidirectionalencoderrepresentationfromtransformersbasedclinicalsemantictextualsimilaritymodelclinicalstsmodelingstudy AT chenshuai usingcharacterlevelandentitylevelrepresentationstoenhancebidirectionalencoderrepresentationfromtransformersbasedclinicalsemantictextualsimilaritymodelclinicalstsmodelingstudy AT chenqingcai usingcharacterlevelandentitylevelrepresentationstoenhancebidirectionalencoderrepresentationfromtransformersbasedclinicalsemantictextualsimilaritymodelclinicalstsmodelingstudy AT yanjun usingcharacterlevelandentitylevelrepresentationstoenhancebidirectionalencoderrepresentationfromtransformersbasedclinicalsemantictextualsimilaritymodelclinicalstsmodelingstudy AT tangbuzhou usingcharacterlevelandentitylevelrepresentationstoenhancebidirectionalencoderrepresentationfromtransformersbasedclinicalsemantictextualsimilaritymodelclinicalstsmodelingstudy |