Cargando…

Measurement of Semantic Textual Similarity in Clinical Texts: Comparison of Transformer-Based Models

BACKGROUND: Semantic textual similarity (STS) is one of the fundamental tasks in natural language processing (NLP). Many shared tasks and corpora for STS have been organized and curated in the general English domain; however, such resources are limited in the biomedical domain. In 2019, the National...

Descripción completa

Detalles Bibliográficos
Autores principales:	Yang, Xi, He, Xing, Zhang, Hansi, Ma, Yinghan, Bian, Jiang, Wu, Yonghui
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	JMIR Publications 2020
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7721552/ https://www.ncbi.nlm.nih.gov/pubmed/33226350 http://dx.doi.org/10.2196/19735

_version_	1783620045785530368
author	Yang, Xi He, Xing Zhang, Hansi Ma, Yinghan Bian, Jiang Wu, Yonghui
author_facet	Yang, Xi He, Xing Zhang, Hansi Ma, Yinghan Bian, Jiang Wu, Yonghui
author_sort	Yang, Xi
collection	PubMed
description	BACKGROUND: Semantic textual similarity (STS) is one of the fundamental tasks in natural language processing (NLP). Many shared tasks and corpora for STS have been organized and curated in the general English domain; however, such resources are limited in the biomedical domain. In 2019, the National NLP Clinical Challenges (n2c2) challenge developed a comprehensive clinical STS dataset and organized a community effort to solicit state-of-the-art solutions for clinical STS. OBJECTIVE: This study presents our transformer-based clinical STS models developed during this challenge as well as new models we explored after the challenge. This project is part of the 2019 n2c2/Open Health NLP shared task on clinical STS. METHODS: In this study, we explored 3 transformer-based models for clinical STS: Bidirectional Encoder Representations from Transformers (BERT), XLNet, and Robustly optimized BERT approach (RoBERTa). We examined transformer models pretrained using both general English text and clinical text. We also explored using a general English STS dataset as a supplementary corpus in addition to the clinical training set developed in this challenge. Furthermore, we investigated various ensemble methods to combine different transformer models. RESULTS: Our best submission based on the XLNet model achieved the third-best performance (Pearson correlation of 0.8864) in this challenge. After the challenge, we further explored other transformer models and improved the performance to 0.9065 using a RoBERTa model, which outperformed the best-performing system developed in this challenge (Pearson correlation of 0.9010). CONCLUSIONS: This study demonstrated the efficiency of utilizing transformer-based models to measure semantic similarity for clinical text. Our models can be applied to clinical applications such as clinical text deduplication and summarization.
format	Online Article Text
id	pubmed-7721552
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	JMIR Publications
record_format	MEDLINE/PubMed
spelling	pubmed-77215522020-12-11 Measurement of Semantic Textual Similarity in Clinical Texts: Comparison of Transformer-Based Models Yang, Xi He, Xing Zhang, Hansi Ma, Yinghan Bian, Jiang Wu, Yonghui JMIR Med Inform Original Paper BACKGROUND: Semantic textual similarity (STS) is one of the fundamental tasks in natural language processing (NLP). Many shared tasks and corpora for STS have been organized and curated in the general English domain; however, such resources are limited in the biomedical domain. In 2019, the National NLP Clinical Challenges (n2c2) challenge developed a comprehensive clinical STS dataset and organized a community effort to solicit state-of-the-art solutions for clinical STS. OBJECTIVE: This study presents our transformer-based clinical STS models developed during this challenge as well as new models we explored after the challenge. This project is part of the 2019 n2c2/Open Health NLP shared task on clinical STS. METHODS: In this study, we explored 3 transformer-based models for clinical STS: Bidirectional Encoder Representations from Transformers (BERT), XLNet, and Robustly optimized BERT approach (RoBERTa). We examined transformer models pretrained using both general English text and clinical text. We also explored using a general English STS dataset as a supplementary corpus in addition to the clinical training set developed in this challenge. Furthermore, we investigated various ensemble methods to combine different transformer models. RESULTS: Our best submission based on the XLNet model achieved the third-best performance (Pearson correlation of 0.8864) in this challenge. After the challenge, we further explored other transformer models and improved the performance to 0.9065 using a RoBERTa model, which outperformed the best-performing system developed in this challenge (Pearson correlation of 0.9010). CONCLUSIONS: This study demonstrated the efficiency of utilizing transformer-based models to measure semantic similarity for clinical text. Our models can be applied to clinical applications such as clinical text deduplication and summarization. JMIR Publications 2020-11-23 /pmc/articles/PMC7721552/ /pubmed/33226350 http://dx.doi.org/10.2196/19735 Text en ©Xi Yang, Xing He, Hansi Zhang, Yinghan Ma, Jiang Bian, Yonghui Wu. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 23.11.2020. https://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
spellingShingle	Original Paper Yang, Xi He, Xing Zhang, Hansi Ma, Yinghan Bian, Jiang Wu, Yonghui Measurement of Semantic Textual Similarity in Clinical Texts: Comparison of Transformer-Based Models
title	Measurement of Semantic Textual Similarity in Clinical Texts: Comparison of Transformer-Based Models
title_full	Measurement of Semantic Textual Similarity in Clinical Texts: Comparison of Transformer-Based Models
title_fullStr	Measurement of Semantic Textual Similarity in Clinical Texts: Comparison of Transformer-Based Models
title_full_unstemmed	Measurement of Semantic Textual Similarity in Clinical Texts: Comparison of Transformer-Based Models
title_short	Measurement of Semantic Textual Similarity in Clinical Texts: Comparison of Transformer-Based Models
title_sort	measurement of semantic textual similarity in clinical texts: comparison of transformer-based models
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7721552/ https://www.ncbi.nlm.nih.gov/pubmed/33226350 http://dx.doi.org/10.2196/19735
work_keys_str_mv	AT yangxi measurementofsemantictextualsimilarityinclinicaltextscomparisonoftransformerbasedmodels AT hexing measurementofsemantictextualsimilarityinclinicaltextscomparisonoftransformerbasedmodels AT zhanghansi measurementofsemantictextualsimilarityinclinicaltextscomparisonoftransformerbasedmodels AT mayinghan measurementofsemantictextualsimilarityinclinicaltextscomparisonoftransformerbasedmodels AT bianjiang measurementofsemantictextualsimilarityinclinicaltextscomparisonoftransformerbasedmodels AT wuyonghui measurementofsemantictextualsimilarityinclinicaltextscomparisonoftransformerbasedmodels

Measurement of Semantic Textual Similarity in Clinical Texts: Comparison of Transformer-Based Models

Ejemplares similares