Cargando…
Semantic Textual Similarity in Japanese Clinical Domain Texts Using BERT
Background Semantic textual similarity (STS) captures the degree of semantic similarity between texts. It plays an important role in many natural language processing applications such as text summarization, question answering, machine translation, information retrieval, dialog systems, plagiarism d...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Georg Thieme Verlag KG
2021
|
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8294940/ https://www.ncbi.nlm.nih.gov/pubmed/34237783 http://dx.doi.org/10.1055/s-0041-1731390 |
_version_ | 1783725335449174016 |
---|---|
author | Mutinda, Faith Wavinya Yada, Shuntaro Wakamiya, Shoko Aramaki, Eiji |
author_facet | Mutinda, Faith Wavinya Yada, Shuntaro Wakamiya, Shoko Aramaki, Eiji |
author_sort | Mutinda, Faith Wavinya |
collection | PubMed |
description | Background Semantic textual similarity (STS) captures the degree of semantic similarity between texts. It plays an important role in many natural language processing applications such as text summarization, question answering, machine translation, information retrieval, dialog systems, plagiarism detection, and query ranking. STS has been widely studied in the general English domain. However, there exists few resources for STS tasks in the clinical domain and in languages other than English, such as Japanese. Objective The objective of this study is to capture semantic similarity between Japanese clinical texts (Japanese clinical STS) by creating a Japanese dataset that is publicly available. Materials We created two datasets for Japanese clinical STS: (1) Japanese case reports (CR dataset) and (2) Japanese electronic medical records (EMR dataset). The CR dataset was created from publicly available case reports extracted from the CiNii database. The EMR dataset was created from Japanese electronic medical records. Methods We used an approach based on bidirectional encoder representations from transformers (BERT) to capture the semantic similarity between the clinical domain texts. BERT is a popular approach for transfer learning and has been proven to be effective in achieving high accuracy for small datasets. We implemented two Japanese pretrained BERT models: a general Japanese BERT and a clinical Japanese BERT. The general Japanese BERT is pretrained on Japanese Wikipedia texts while the clinical Japanese BERT is pretrained on Japanese clinical texts. Results The BERT models performed well in capturing semantic similarity in our datasets. The general Japanese BERT outperformed the clinical Japanese BERT and achieved a high correlation with human score (0.904 in the CR dataset and 0.875 in the EMR dataset). It was unexpected that the general Japanese BERT outperformed the clinical Japanese BERT on clinical domain dataset. This could be due to the fact that the general Japanese BERT is pretrained on a wide range of texts compared with the clinical Japanese BERT. |
format | Online Article Text |
id | pubmed-8294940 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Georg Thieme Verlag KG |
record_format | MEDLINE/PubMed |
spelling | pubmed-82949402021-07-23 Semantic Textual Similarity in Japanese Clinical Domain Texts Using BERT Mutinda, Faith Wavinya Yada, Shuntaro Wakamiya, Shoko Aramaki, Eiji Methods Inf Med Background Semantic textual similarity (STS) captures the degree of semantic similarity between texts. It plays an important role in many natural language processing applications such as text summarization, question answering, machine translation, information retrieval, dialog systems, plagiarism detection, and query ranking. STS has been widely studied in the general English domain. However, there exists few resources for STS tasks in the clinical domain and in languages other than English, such as Japanese. Objective The objective of this study is to capture semantic similarity between Japanese clinical texts (Japanese clinical STS) by creating a Japanese dataset that is publicly available. Materials We created two datasets for Japanese clinical STS: (1) Japanese case reports (CR dataset) and (2) Japanese electronic medical records (EMR dataset). The CR dataset was created from publicly available case reports extracted from the CiNii database. The EMR dataset was created from Japanese electronic medical records. Methods We used an approach based on bidirectional encoder representations from transformers (BERT) to capture the semantic similarity between the clinical domain texts. BERT is a popular approach for transfer learning and has been proven to be effective in achieving high accuracy for small datasets. We implemented two Japanese pretrained BERT models: a general Japanese BERT and a clinical Japanese BERT. The general Japanese BERT is pretrained on Japanese Wikipedia texts while the clinical Japanese BERT is pretrained on Japanese clinical texts. Results The BERT models performed well in capturing semantic similarity in our datasets. The general Japanese BERT outperformed the clinical Japanese BERT and achieved a high correlation with human score (0.904 in the CR dataset and 0.875 in the EMR dataset). It was unexpected that the general Japanese BERT outperformed the clinical Japanese BERT on clinical domain dataset. This could be due to the fact that the general Japanese BERT is pretrained on a wide range of texts compared with the clinical Japanese BERT. Georg Thieme Verlag KG 2021-06 2021-07-08 /pmc/articles/PMC8294940/ /pubmed/34237783 http://dx.doi.org/10.1055/s-0041-1731390 Text en The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution-NonDerivative-NonCommercial License, permitting copying and reproduction so long as the original work is given appropriate credit. Contents may not be used for commercial purposes, or adapted, remixed, transformed or built upon. ( https://creativecommons.org/licenses/by-nc-nd/4.0/ ) https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives License, which permits unrestricted reproduction and distribution, for non-commercial purposes only; and use and reproduction, but not distribution, of adapted material for non-commercial purposes only, provided the original work is properly cited. |
spellingShingle | Mutinda, Faith Wavinya Yada, Shuntaro Wakamiya, Shoko Aramaki, Eiji Semantic Textual Similarity in Japanese Clinical Domain Texts Using BERT |
title | Semantic Textual Similarity in Japanese Clinical Domain Texts Using BERT |
title_full | Semantic Textual Similarity in Japanese Clinical Domain Texts Using BERT |
title_fullStr | Semantic Textual Similarity in Japanese Clinical Domain Texts Using BERT |
title_full_unstemmed | Semantic Textual Similarity in Japanese Clinical Domain Texts Using BERT |
title_short | Semantic Textual Similarity in Japanese Clinical Domain Texts Using BERT |
title_sort | semantic textual similarity in japanese clinical domain texts using bert |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8294940/ https://www.ncbi.nlm.nih.gov/pubmed/34237783 http://dx.doi.org/10.1055/s-0041-1731390 |
work_keys_str_mv | AT mutindafaithwavinya semantictextualsimilarityinjapaneseclinicaldomaintextsusingbert AT yadashuntaro semantictextualsimilarityinjapaneseclinicaldomaintextsusingbert AT wakamiyashoko semantictextualsimilarityinjapaneseclinicaldomaintextsusingbert AT aramakieiji semantictextualsimilarityinjapaneseclinicaldomaintextsusingbert |