Cargando…
Adapting Bidirectional Encoder Representations from Transformers (BERT) to Assess Clinical Semantic Textual Similarity: Algorithm Development and Validation Study
BACKGROUND: Natural Language Understanding enables automatic extraction of relevant information from clinical text data, which are acquired every day in hospitals. In 2018, the language model Bidirectional Encoder Representations from Transformers (BERT) was introduced, generating new state-of-the-a...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
JMIR Publications
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7889424/ https://www.ncbi.nlm.nih.gov/pubmed/33533728 http://dx.doi.org/10.2196/22795 |
_version_ | 1783652306883969024 |
---|---|
author | Kades, Klaus Sellner, Jan Koehler, Gregor Full, Peter M Lai, T Y Emmy Kleesiek, Jens Maier-Hein, Klaus H |
author_facet | Kades, Klaus Sellner, Jan Koehler, Gregor Full, Peter M Lai, T Y Emmy Kleesiek, Jens Maier-Hein, Klaus H |
author_sort | Kades, Klaus |
collection | PubMed |
description | BACKGROUND: Natural Language Understanding enables automatic extraction of relevant information from clinical text data, which are acquired every day in hospitals. In 2018, the language model Bidirectional Encoder Representations from Transformers (BERT) was introduced, generating new state-of-the-art results on several downstream tasks. The National NLP Clinical Challenges (n2c2) is an initiative that strives to tackle such downstream tasks on domain-specific clinical data. In this paper, we present the results of our participation in the 2019 n2c2 and related work completed thereafter. OBJECTIVE: The objective of this study was to optimally leverage BERT for the task of assessing the semantic textual similarity of clinical text data. METHODS: We used BERT as an initial baseline and analyzed the results, which we used as a starting point to develop 3 different approaches where we (1) added additional, handcrafted sentence similarity features to the classifier token of BERT and combined the results with more features in multiple regression estimators, (2) incorporated a built-in ensembling method, M-Heads, into BERT by duplicating the regression head and applying an adapted training strategy to facilitate the focus of the heads on different input patterns of the medical sentences, and (3) developed a graph-based similarity approach for medications, which allows extrapolating similarities across known entities from the training set. The approaches were evaluated with the Pearson correlation coefficient between the predicted scores and ground truth of the official training and test dataset. RESULTS: We improved the performance of BERT on the test dataset from a Pearson correlation coefficient of 0.859 to 0.883 using a combination of the M-Heads method and the graph-based similarity approach. We also show differences between the test and training dataset and how the two datasets influenced the results. CONCLUSIONS: We found that using a graph-based similarity approach has the potential to extrapolate domain specific knowledge to unseen sentences. We observed that it is easily possible to obtain deceptive results from the test dataset, especially when the distribution of the data samples is different between training and test datasets. |
format | Online Article Text |
id | pubmed-7889424 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | JMIR Publications |
record_format | MEDLINE/PubMed |
spelling | pubmed-78894242021-03-05 Adapting Bidirectional Encoder Representations from Transformers (BERT) to Assess Clinical Semantic Textual Similarity: Algorithm Development and Validation Study Kades, Klaus Sellner, Jan Koehler, Gregor Full, Peter M Lai, T Y Emmy Kleesiek, Jens Maier-Hein, Klaus H JMIR Med Inform Original Paper BACKGROUND: Natural Language Understanding enables automatic extraction of relevant information from clinical text data, which are acquired every day in hospitals. In 2018, the language model Bidirectional Encoder Representations from Transformers (BERT) was introduced, generating new state-of-the-art results on several downstream tasks. The National NLP Clinical Challenges (n2c2) is an initiative that strives to tackle such downstream tasks on domain-specific clinical data. In this paper, we present the results of our participation in the 2019 n2c2 and related work completed thereafter. OBJECTIVE: The objective of this study was to optimally leverage BERT for the task of assessing the semantic textual similarity of clinical text data. METHODS: We used BERT as an initial baseline and analyzed the results, which we used as a starting point to develop 3 different approaches where we (1) added additional, handcrafted sentence similarity features to the classifier token of BERT and combined the results with more features in multiple regression estimators, (2) incorporated a built-in ensembling method, M-Heads, into BERT by duplicating the regression head and applying an adapted training strategy to facilitate the focus of the heads on different input patterns of the medical sentences, and (3) developed a graph-based similarity approach for medications, which allows extrapolating similarities across known entities from the training set. The approaches were evaluated with the Pearson correlation coefficient between the predicted scores and ground truth of the official training and test dataset. RESULTS: We improved the performance of BERT on the test dataset from a Pearson correlation coefficient of 0.859 to 0.883 using a combination of the M-Heads method and the graph-based similarity approach. We also show differences between the test and training dataset and how the two datasets influenced the results. CONCLUSIONS: We found that using a graph-based similarity approach has the potential to extrapolate domain specific knowledge to unseen sentences. We observed that it is easily possible to obtain deceptive results from the test dataset, especially when the distribution of the data samples is different between training and test datasets. JMIR Publications 2021-02-03 /pmc/articles/PMC7889424/ /pubmed/33533728 http://dx.doi.org/10.2196/22795 Text en ©Klaus Kades, Jan Sellner, Gregor Koehler, Peter M Full, T Y Emmy Lai, Jens Kleesiek, Klaus H Maier-Hein. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 03.02.2021. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included. |
spellingShingle | Original Paper Kades, Klaus Sellner, Jan Koehler, Gregor Full, Peter M Lai, T Y Emmy Kleesiek, Jens Maier-Hein, Klaus H Adapting Bidirectional Encoder Representations from Transformers (BERT) to Assess Clinical Semantic Textual Similarity: Algorithm Development and Validation Study |
title | Adapting Bidirectional Encoder Representations from Transformers (BERT) to Assess Clinical Semantic Textual Similarity: Algorithm Development and Validation Study |
title_full | Adapting Bidirectional Encoder Representations from Transformers (BERT) to Assess Clinical Semantic Textual Similarity: Algorithm Development and Validation Study |
title_fullStr | Adapting Bidirectional Encoder Representations from Transformers (BERT) to Assess Clinical Semantic Textual Similarity: Algorithm Development and Validation Study |
title_full_unstemmed | Adapting Bidirectional Encoder Representations from Transformers (BERT) to Assess Clinical Semantic Textual Similarity: Algorithm Development and Validation Study |
title_short | Adapting Bidirectional Encoder Representations from Transformers (BERT) to Assess Clinical Semantic Textual Similarity: Algorithm Development and Validation Study |
title_sort | adapting bidirectional encoder representations from transformers (bert) to assess clinical semantic textual similarity: algorithm development and validation study |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7889424/ https://www.ncbi.nlm.nih.gov/pubmed/33533728 http://dx.doi.org/10.2196/22795 |
work_keys_str_mv | AT kadesklaus adaptingbidirectionalencoderrepresentationsfromtransformersberttoassessclinicalsemantictextualsimilarityalgorithmdevelopmentandvalidationstudy AT sellnerjan adaptingbidirectionalencoderrepresentationsfromtransformersberttoassessclinicalsemantictextualsimilarityalgorithmdevelopmentandvalidationstudy AT koehlergregor adaptingbidirectionalencoderrepresentationsfromtransformersberttoassessclinicalsemantictextualsimilarityalgorithmdevelopmentandvalidationstudy AT fullpeterm adaptingbidirectionalencoderrepresentationsfromtransformersberttoassessclinicalsemantictextualsimilarityalgorithmdevelopmentandvalidationstudy AT laityemmy adaptingbidirectionalencoderrepresentationsfromtransformersberttoassessclinicalsemantictextualsimilarityalgorithmdevelopmentandvalidationstudy AT kleesiekjens adaptingbidirectionalencoderrepresentationsfromtransformersberttoassessclinicalsemantictextualsimilarityalgorithmdevelopmentandvalidationstudy AT maierheinklaush adaptingbidirectionalencoderrepresentationsfromtransformersberttoassessclinicalsemantictextualsimilarityalgorithmdevelopmentandvalidationstudy |