Cargando…
Benchmarking Effectiveness and Efficiency of Deep Learning Models for Semantic Textual Similarity in the Clinical Domain: Validation Study
BACKGROUND: Semantic textual similarity (STS) measures the degree of relatedness between sentence pairs. The Open Health Natural Language Processing (OHNLP) Consortium released an expertly annotated STS data set and called for the National Natural Language Processing Clinical Challenges. This work d...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
JMIR Publications
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8759018/ https://www.ncbi.nlm.nih.gov/pubmed/34967748 http://dx.doi.org/10.2196/27386 |
_version_ | 1784633026327609344 |
---|---|
author | Chen, Qingyu Rankine, Alex Peng, Yifan Aghaarabi, Elaheh Lu, Zhiyong |
author_facet | Chen, Qingyu Rankine, Alex Peng, Yifan Aghaarabi, Elaheh Lu, Zhiyong |
author_sort | Chen, Qingyu |
collection | PubMed |
description | BACKGROUND: Semantic textual similarity (STS) measures the degree of relatedness between sentence pairs. The Open Health Natural Language Processing (OHNLP) Consortium released an expertly annotated STS data set and called for the National Natural Language Processing Clinical Challenges. This work describes our entry, an ensemble model that leverages a range of deep learning (DL) models. Our team from the National Library of Medicine obtained a Pearson correlation of 0.8967 in an official test set during 2019 National Natural Language Processing Clinical Challenges/Open Health Natural Language Processing shared task and achieved a second rank. OBJECTIVE: Although our models strongly correlate with manual annotations, annotator-level correlation was only moderate (weighted Cohen κ=0.60). We are cautious of the potential use of DL models in production systems and argue that it is more critical to evaluate the models in-depth, especially those with extremely high correlations. In this study, we benchmark the effectiveness and efficiency of top-ranked DL models. We quantify their robustness and inference times to validate their usefulness in real-time applications. METHODS: We benchmarked five DL models, which are the top-ranked systems for STS tasks: Convolutional Neural Network, BioSentVec, BioBERT, BlueBERT, and ClinicalBERT. We evaluated a random forest model as an additional baseline. For each model, we repeated the experiment 10 times, using the official training and testing sets. We reported 95% CI of the Wilcoxon rank-sum test on the average Pearson correlation (official evaluation metric) and running time. We further evaluated Spearman correlation, R², and mean squared error as additional measures. RESULTS: Using only the official training set, all models obtained highly effective results. BioSentVec and BioBERT achieved the highest average Pearson correlations (0.8497 and 0.8481, respectively). BioSentVec also had the highest results in 3 of 4 effectiveness measures, followed by BioBERT. However, their robustness to sentence pairs of different similarity levels varies significantly. A particular observation is that BERT models made the most errors (a mean squared error of over 2.5) on highly similar sentence pairs. They cannot capture highly similar sentence pairs effectively when they have different negation terms or word orders. In addition, time efficiency is dramatically different from the effectiveness results. On average, the BERT models were approximately 20 times and 50 times slower than the Convolutional Neural Network and BioSentVec models, respectively. This results in challenges for real-time applications. CONCLUSIONS: Despite the excitement of further improving Pearson correlations in this data set, our results highlight that evaluations of the effectiveness and efficiency of STS models are critical. In future, we suggest more evaluations on the generalization capability and user-level testing of the models. We call for community efforts to create more biomedical and clinical STS data sets from different perspectives to reflect the multifaceted notion of sentence-relatedness. |
format | Online Article Text |
id | pubmed-8759018 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | JMIR Publications |
record_format | MEDLINE/PubMed |
spelling | pubmed-87590182022-02-03 Benchmarking Effectiveness and Efficiency of Deep Learning Models for Semantic Textual Similarity in the Clinical Domain: Validation Study Chen, Qingyu Rankine, Alex Peng, Yifan Aghaarabi, Elaheh Lu, Zhiyong JMIR Med Inform Original Paper BACKGROUND: Semantic textual similarity (STS) measures the degree of relatedness between sentence pairs. The Open Health Natural Language Processing (OHNLP) Consortium released an expertly annotated STS data set and called for the National Natural Language Processing Clinical Challenges. This work describes our entry, an ensemble model that leverages a range of deep learning (DL) models. Our team from the National Library of Medicine obtained a Pearson correlation of 0.8967 in an official test set during 2019 National Natural Language Processing Clinical Challenges/Open Health Natural Language Processing shared task and achieved a second rank. OBJECTIVE: Although our models strongly correlate with manual annotations, annotator-level correlation was only moderate (weighted Cohen κ=0.60). We are cautious of the potential use of DL models in production systems and argue that it is more critical to evaluate the models in-depth, especially those with extremely high correlations. In this study, we benchmark the effectiveness and efficiency of top-ranked DL models. We quantify their robustness and inference times to validate their usefulness in real-time applications. METHODS: We benchmarked five DL models, which are the top-ranked systems for STS tasks: Convolutional Neural Network, BioSentVec, BioBERT, BlueBERT, and ClinicalBERT. We evaluated a random forest model as an additional baseline. For each model, we repeated the experiment 10 times, using the official training and testing sets. We reported 95% CI of the Wilcoxon rank-sum test on the average Pearson correlation (official evaluation metric) and running time. We further evaluated Spearman correlation, R², and mean squared error as additional measures. RESULTS: Using only the official training set, all models obtained highly effective results. BioSentVec and BioBERT achieved the highest average Pearson correlations (0.8497 and 0.8481, respectively). BioSentVec also had the highest results in 3 of 4 effectiveness measures, followed by BioBERT. However, their robustness to sentence pairs of different similarity levels varies significantly. A particular observation is that BERT models made the most errors (a mean squared error of over 2.5) on highly similar sentence pairs. They cannot capture highly similar sentence pairs effectively when they have different negation terms or word orders. In addition, time efficiency is dramatically different from the effectiveness results. On average, the BERT models were approximately 20 times and 50 times slower than the Convolutional Neural Network and BioSentVec models, respectively. This results in challenges for real-time applications. CONCLUSIONS: Despite the excitement of further improving Pearson correlations in this data set, our results highlight that evaluations of the effectiveness and efficiency of STS models are critical. In future, we suggest more evaluations on the generalization capability and user-level testing of the models. We call for community efforts to create more biomedical and clinical STS data sets from different perspectives to reflect the multifaceted notion of sentence-relatedness. JMIR Publications 2021-12-30 /pmc/articles/PMC8759018/ /pubmed/34967748 http://dx.doi.org/10.2196/27386 Text en ©Qingyu Chen, Alex Rankine, Yifan Peng, Elaheh Aghaarabi, Zhiyong Lu. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 30.12.2021. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included. |
spellingShingle | Original Paper Chen, Qingyu Rankine, Alex Peng, Yifan Aghaarabi, Elaheh Lu, Zhiyong Benchmarking Effectiveness and Efficiency of Deep Learning Models for Semantic Textual Similarity in the Clinical Domain: Validation Study |
title | Benchmarking Effectiveness and Efficiency of Deep Learning Models for Semantic Textual Similarity in the Clinical Domain: Validation Study |
title_full | Benchmarking Effectiveness and Efficiency of Deep Learning Models for Semantic Textual Similarity in the Clinical Domain: Validation Study |
title_fullStr | Benchmarking Effectiveness and Efficiency of Deep Learning Models for Semantic Textual Similarity in the Clinical Domain: Validation Study |
title_full_unstemmed | Benchmarking Effectiveness and Efficiency of Deep Learning Models for Semantic Textual Similarity in the Clinical Domain: Validation Study |
title_short | Benchmarking Effectiveness and Efficiency of Deep Learning Models for Semantic Textual Similarity in the Clinical Domain: Validation Study |
title_sort | benchmarking effectiveness and efficiency of deep learning models for semantic textual similarity in the clinical domain: validation study |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8759018/ https://www.ncbi.nlm.nih.gov/pubmed/34967748 http://dx.doi.org/10.2196/27386 |
work_keys_str_mv | AT chenqingyu benchmarkingeffectivenessandefficiencyofdeeplearningmodelsforsemantictextualsimilarityintheclinicaldomainvalidationstudy AT rankinealex benchmarkingeffectivenessandefficiencyofdeeplearningmodelsforsemantictextualsimilarityintheclinicaldomainvalidationstudy AT pengyifan benchmarkingeffectivenessandefficiencyofdeeplearningmodelsforsemantictextualsimilarityintheclinicaldomainvalidationstudy AT aghaarabielaheh benchmarkingeffectivenessandefficiencyofdeeplearningmodelsforsemantictextualsimilarityintheclinicaldomainvalidationstudy AT luzhiyong benchmarkingeffectivenessandefficiencyofdeeplearningmodelsforsemantictextualsimilarityintheclinicaldomainvalidationstudy |