Cargando…

A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and Validation

BACKGROUND: The fact that medical terms require special expertise and are becoming increasingly complex makes it difficult to employ natural language processing techniques in medical informatics. Several human-validated reference standards for medical terms have been developed to evaluate word embed...

Descripción completa

Detalles Bibliográficos
Autores principales: Yum, Yunjin, Lee, Jeong Moon, Jang, Moon Joung, Kim, Yoojoong, Kim, Jong-Ho, Kim, Seongtae, Shin, Unsub, Song, Sanghoun, Joo, Hyung Joon
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8277378/
https://www.ncbi.nlm.nih.gov/pubmed/34185005
http://dx.doi.org/10.2196/29667
_version_ 1783722061584138240
author Yum, Yunjin
Lee, Jeong Moon
Jang, Moon Joung
Kim, Yoojoong
Kim, Jong-Ho
Kim, Seongtae
Shin, Unsub
Song, Sanghoun
Joo, Hyung Joon
author_facet Yum, Yunjin
Lee, Jeong Moon
Jang, Moon Joung
Kim, Yoojoong
Kim, Jong-Ho
Kim, Seongtae
Shin, Unsub
Song, Sanghoun
Joo, Hyung Joon
author_sort Yum, Yunjin
collection PubMed
description BACKGROUND: The fact that medical terms require special expertise and are becoming increasingly complex makes it difficult to employ natural language processing techniques in medical informatics. Several human-validated reference standards for medical terms have been developed to evaluate word embedding models using the semantic similarity and relatedness of medical word pairs. However, there are very few reference standards in non-English languages. In addition, because the existing reference standards were developed a long time ago, there is a need to develop an updated standard to represent recent findings in medical sciences. OBJECTIVE: We propose a new Korean word pair reference set to verify embedding models. METHODS: From January 2010 to December 2020, 518 medical textbooks, 72,844 health information news, and 15,698 medical research articles were collected, and the top 10,000 medical terms were selected to develop medical word pairs. Attending physicians (n=16) participated in the verification of the developed set with 607 word pairs. RESULTS: The proportion of word pairs answered by all participants was 90.8% (551/607) for the similarity task and 86.5% (525/605) for the relatedness task. The similarity and relatedness of the word pair showed a high correlation (ρ=0.70, P<.001). The intraclass correlation coefficients to assess the interrater agreements of the word pair sets were 0.47 on the similarity task and 0.53 on the relatedness task. The final reference standard was 604 word pairs for the similarity task and 599 word pairs for relatedness, excluding word pairs with answers corresponding to outliers and word pairs that were answered by less than 50% of all the respondents. When FastText models were applied to the final reference standard word pair sets, the embedding models learning medical documents had a higher correlation between the calculated cosine similarity scores compared to human-judged similarity and relatedness scores (namu, ρ=0.12 vs with medical text for the similarity task, ρ=0.47; namu, ρ=0.02 vs with medical text for the relatedness task, ρ=0.30). CONCLUSIONS: Korean medical word pair reference standard sets for semantic similarity and relatedness were developed based on medical documents from the past 10 years. It is expected that our word pair reference sets will be actively utilized in the development of medical and multilingual natural language processing technology in the future.
format Online
Article
Text
id pubmed-8277378
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher JMIR Publications
record_format MEDLINE/PubMed
spelling pubmed-82773782021-07-26 A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and Validation Yum, Yunjin Lee, Jeong Moon Jang, Moon Joung Kim, Yoojoong Kim, Jong-Ho Kim, Seongtae Shin, Unsub Song, Sanghoun Joo, Hyung Joon JMIR Med Inform Original Paper BACKGROUND: The fact that medical terms require special expertise and are becoming increasingly complex makes it difficult to employ natural language processing techniques in medical informatics. Several human-validated reference standards for medical terms have been developed to evaluate word embedding models using the semantic similarity and relatedness of medical word pairs. However, there are very few reference standards in non-English languages. In addition, because the existing reference standards were developed a long time ago, there is a need to develop an updated standard to represent recent findings in medical sciences. OBJECTIVE: We propose a new Korean word pair reference set to verify embedding models. METHODS: From January 2010 to December 2020, 518 medical textbooks, 72,844 health information news, and 15,698 medical research articles were collected, and the top 10,000 medical terms were selected to develop medical word pairs. Attending physicians (n=16) participated in the verification of the developed set with 607 word pairs. RESULTS: The proportion of word pairs answered by all participants was 90.8% (551/607) for the similarity task and 86.5% (525/605) for the relatedness task. The similarity and relatedness of the word pair showed a high correlation (ρ=0.70, P<.001). The intraclass correlation coefficients to assess the interrater agreements of the word pair sets were 0.47 on the similarity task and 0.53 on the relatedness task. The final reference standard was 604 word pairs for the similarity task and 599 word pairs for relatedness, excluding word pairs with answers corresponding to outliers and word pairs that were answered by less than 50% of all the respondents. When FastText models were applied to the final reference standard word pair sets, the embedding models learning medical documents had a higher correlation between the calculated cosine similarity scores compared to human-judged similarity and relatedness scores (namu, ρ=0.12 vs with medical text for the similarity task, ρ=0.47; namu, ρ=0.02 vs with medical text for the relatedness task, ρ=0.30). CONCLUSIONS: Korean medical word pair reference standard sets for semantic similarity and relatedness were developed based on medical documents from the past 10 years. It is expected that our word pair reference sets will be actively utilized in the development of medical and multilingual natural language processing technology in the future. JMIR Publications 2021-06-24 /pmc/articles/PMC8277378/ /pubmed/34185005 http://dx.doi.org/10.2196/29667 Text en ©Yunjin Yum, Jeong Moon Lee, Moon Joung Jang, Yoojoong Kim, Jong-Ho Kim, Seongtae Kim, Unsub Shin, Sanghoun Song, Hyung Joon Joo. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 24.06.2021. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
spellingShingle Original Paper
Yum, Yunjin
Lee, Jeong Moon
Jang, Moon Joung
Kim, Yoojoong
Kim, Jong-Ho
Kim, Seongtae
Shin, Unsub
Song, Sanghoun
Joo, Hyung Joon
A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and Validation
title A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and Validation
title_full A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and Validation
title_fullStr A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and Validation
title_full_unstemmed A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and Validation
title_short A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and Validation
title_sort word pair dataset for semantic similarity and relatedness in korean medical vocabulary: reference development and validation
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8277378/
https://www.ncbi.nlm.nih.gov/pubmed/34185005
http://dx.doi.org/10.2196/29667
work_keys_str_mv AT yumyunjin awordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT leejeongmoon awordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT jangmoonjoung awordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT kimyoojoong awordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT kimjongho awordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT kimseongtae awordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT shinunsub awordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT songsanghoun awordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT joohyungjoon awordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT yumyunjin wordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT leejeongmoon wordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT jangmoonjoung wordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT kimyoojoong wordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT kimjongho wordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT kimseongtae wordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT shinunsub wordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT songsanghoun wordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation
AT joohyungjoon wordpairdatasetforsemanticsimilarityandrelatednessinkoreanmedicalvocabularyreferencedevelopmentandvalidation