Cargando…

Incorporating Domain Knowledge Into Language Models by Using Graph Convolutional Networks for Assessing Semantic Textual Similarity: Model Development and Performance Comparison

BACKGROUND: Although electronic health record systems have facilitated clinical documentation in health care, they have also introduced new challenges, such as the proliferation of redundant information through the use of copy and paste commands or templates. One approach to trimming down bloated cl...

Descripción completa

Detalles Bibliográficos
Autores principales:	Chang, David, Lin, Eric, Brandt, Cynthia, Taylor, Richard Andrew
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	JMIR Publications 2021
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8665398/ https://www.ncbi.nlm.nih.gov/pubmed/34842531 http://dx.doi.org/10.2196/23101

_version_	1784614000914333696
author	Chang, David Lin, Eric Brandt, Cynthia Taylor, Richard Andrew
author_facet	Chang, David Lin, Eric Brandt, Cynthia Taylor, Richard Andrew
author_sort	Chang, David
collection	PubMed
description	BACKGROUND: Although electronic health record systems have facilitated clinical documentation in health care, they have also introduced new challenges, such as the proliferation of redundant information through the use of copy and paste commands or templates. One approach to trimming down bloated clinical documentation and improving clinical summarization is to identify highly similar text snippets with the goal of removing such text. OBJECTIVE: We developed a natural language processing system for the task of assessing clinical semantic textual similarity. The system assigns scores to pairs of clinical text snippets based on their clinical semantic similarity. METHODS: We leveraged recent advances in natural language processing and graph representation learning to create a model that combines linguistic and domain knowledge information from the MedSTS data set to assess clinical semantic textual similarity. We used bidirectional encoder representation from transformers (BERT)–based models as text encoders for the sentence pairs in the data set and graph convolutional networks (GCNs) as graph encoders for corresponding concept graphs that were constructed based on the sentences. We also explored techniques, including data augmentation, ensembling, and knowledge distillation, to improve the model’s performance, as measured by the Pearson correlation coefficient (r). RESULTS: Fine-tuning the BERT_base and ClinicalBERT models on the MedSTS data set provided a strong baseline (Pearson correlation coefficients: 0.842 and 0.848, respectively) compared to those of the previous year’s submissions. Our data augmentation techniques yielded moderate gains in performance, and adding a GCN-based graph encoder to incorporate the concept graphs also boosted performance, especially when the node features were initialized with pretrained knowledge graph embeddings of the concepts (r=0.868). As expected, ensembling improved performance, and performing multisource ensembling by using different language model variants, conducting knowledge distillation with the multisource ensemble model, and taking a final ensemble of the distilled models further improved the system’s performance (Pearson correlation coefficients: 0.875, 0.878, and 0.882, respectively). CONCLUSIONS: This study presents a system for the MedSTS clinical semantic textual similarity benchmark task, which was created by combining BERT-based text encoders and GCN-based graph encoders in order to incorporate domain knowledge into the natural language processing pipeline. We also experimented with other techniques involving data augmentation, pretrained concept embeddings, ensembling, and knowledge distillation to further increase our system’s performance. Although the task and its benchmark data set are in the early stages of development, this study, as well as the results of the competition, demonstrates the potential of modern language model–based systems to detect redundant information in clinical notes.
format	Online Article Text
id	pubmed-8665398
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	JMIR Publications
record_format	MEDLINE/PubMed
spelling	pubmed-86653982021-12-30 Incorporating Domain Knowledge Into Language Models by Using Graph Convolutional Networks for Assessing Semantic Textual Similarity: Model Development and Performance Comparison Chang, David Lin, Eric Brandt, Cynthia Taylor, Richard Andrew JMIR Med Inform Original Paper BACKGROUND: Although electronic health record systems have facilitated clinical documentation in health care, they have also introduced new challenges, such as the proliferation of redundant information through the use of copy and paste commands or templates. One approach to trimming down bloated clinical documentation and improving clinical summarization is to identify highly similar text snippets with the goal of removing such text. OBJECTIVE: We developed a natural language processing system for the task of assessing clinical semantic textual similarity. The system assigns scores to pairs of clinical text snippets based on their clinical semantic similarity. METHODS: We leveraged recent advances in natural language processing and graph representation learning to create a model that combines linguistic and domain knowledge information from the MedSTS data set to assess clinical semantic textual similarity. We used bidirectional encoder representation from transformers (BERT)–based models as text encoders for the sentence pairs in the data set and graph convolutional networks (GCNs) as graph encoders for corresponding concept graphs that were constructed based on the sentences. We also explored techniques, including data augmentation, ensembling, and knowledge distillation, to improve the model’s performance, as measured by the Pearson correlation coefficient (r). RESULTS: Fine-tuning the BERT_base and ClinicalBERT models on the MedSTS data set provided a strong baseline (Pearson correlation coefficients: 0.842 and 0.848, respectively) compared to those of the previous year’s submissions. Our data augmentation techniques yielded moderate gains in performance, and adding a GCN-based graph encoder to incorporate the concept graphs also boosted performance, especially when the node features were initialized with pretrained knowledge graph embeddings of the concepts (r=0.868). As expected, ensembling improved performance, and performing multisource ensembling by using different language model variants, conducting knowledge distillation with the multisource ensemble model, and taking a final ensemble of the distilled models further improved the system’s performance (Pearson correlation coefficients: 0.875, 0.878, and 0.882, respectively). CONCLUSIONS: This study presents a system for the MedSTS clinical semantic textual similarity benchmark task, which was created by combining BERT-based text encoders and GCN-based graph encoders in order to incorporate domain knowledge into the natural language processing pipeline. We also experimented with other techniques involving data augmentation, pretrained concept embeddings, ensembling, and knowledge distillation to further increase our system’s performance. Although the task and its benchmark data set are in the early stages of development, this study, as well as the results of the competition, demonstrates the potential of modern language model–based systems to detect redundant information in clinical notes. JMIR Publications 2021-11-26 /pmc/articles/PMC8665398/ /pubmed/34842531 http://dx.doi.org/10.2196/23101 Text en ©David Chang, Eric Lin, Cynthia Brandt, Richard Andrew Taylor. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 26.11.2021. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
spellingShingle	Original Paper Chang, David Lin, Eric Brandt, Cynthia Taylor, Richard Andrew Incorporating Domain Knowledge Into Language Models by Using Graph Convolutional Networks for Assessing Semantic Textual Similarity: Model Development and Performance Comparison
title	Incorporating Domain Knowledge Into Language Models by Using Graph Convolutional Networks for Assessing Semantic Textual Similarity: Model Development and Performance Comparison
title_full	Incorporating Domain Knowledge Into Language Models by Using Graph Convolutional Networks for Assessing Semantic Textual Similarity: Model Development and Performance Comparison
title_fullStr	Incorporating Domain Knowledge Into Language Models by Using Graph Convolutional Networks for Assessing Semantic Textual Similarity: Model Development and Performance Comparison
title_full_unstemmed	Incorporating Domain Knowledge Into Language Models by Using Graph Convolutional Networks for Assessing Semantic Textual Similarity: Model Development and Performance Comparison
title_short	Incorporating Domain Knowledge Into Language Models by Using Graph Convolutional Networks for Assessing Semantic Textual Similarity: Model Development and Performance Comparison
title_sort	incorporating domain knowledge into language models by using graph convolutional networks for assessing semantic textual similarity: model development and performance comparison
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8665398/ https://www.ncbi.nlm.nih.gov/pubmed/34842531 http://dx.doi.org/10.2196/23101
work_keys_str_mv	AT changdavid incorporatingdomainknowledgeintolanguagemodelsbyusinggraphconvolutionalnetworksforassessingsemantictextualsimilaritymodeldevelopmentandperformancecomparison AT lineric incorporatingdomainknowledgeintolanguagemodelsbyusinggraphconvolutionalnetworksforassessingsemantictextualsimilaritymodeldevelopmentandperformancecomparison AT brandtcynthia incorporatingdomainknowledgeintolanguagemodelsbyusinggraphconvolutionalnetworksforassessingsemantictextualsimilaritymodeldevelopmentandperformancecomparison AT taylorrichardandrew incorporatingdomainknowledgeintolanguagemodelsbyusinggraphconvolutionalnetworksforassessingsemantictextualsimilaritymodeldevelopmentandperformancecomparison

Incorporating Domain Knowledge Into Language Models by Using Graph Convolutional Networks for Assessing Semantic Textual Similarity: Model Development and Performance Comparison

Ejemplares similares