Cargando…

Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec

BACKGROUND: Understanding semantic relatedness and similarity between biomedical terms has a great impact on a variety of applications such as biomedical information retrieval, information extraction, and recommender systems. The objective of this study is to examine word2vec’s ability in deriving s...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zhu, Yongjun, Yan, Erjia, Wang, Fei
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2017
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5496182/ https://www.ncbi.nlm.nih.gov/pubmed/28673289 http://dx.doi.org/10.1186/s12911-017-0498-1

_version_	1783247921043472384
author	Zhu, Yongjun Yan, Erjia Wang, Fei
author_facet	Zhu, Yongjun Yan, Erjia Wang, Fei
author_sort	Zhu, Yongjun
collection	PubMed
description	BACKGROUND: Understanding semantic relatedness and similarity between biomedical terms has a great impact on a variety of applications such as biomedical information retrieval, information extraction, and recommender systems. The objective of this study is to examine word2vec’s ability in deriving semantic relatedness and similarity between biomedical terms from large publication data. Specifically, we focus on the effects of recency, size, and section of biomedical publication data on the performance of word2vec. METHODS: We download abstracts of 18,777,129 articles from PubMed and 766,326 full-text articles from PubMed Central (PMC). The datasets are preprocessed and grouped into subsets by recency, size, and section. Word2vec models are trained on these subtests. Cosine similarities between biomedical terms obtained from the word2vec models are compared against reference standards. Performance of models trained on different subsets are compared to examine recency, size, and section effects. RESULTS: Models trained on recent datasets did not boost the performance. Models trained on larger datasets identified more pairs of biomedical terms than models trained on smaller datasets in relatedness task (from 368 at the 10% level to 494 at the 100% level) and similarity task (from 374 at the 10% level to 491 at the 100% level). The model trained on abstracts produced results that have higher correlations with the reference standards than the one trained on article bodies (i.e., 0.65 vs. 0.62 in the similarity task and 0.66 vs. 0.59 in the relatedness task). However, the latter identified more pairs of biomedical terms than the former (i.e., 344 vs. 498 in the similarity task and 339 vs. 503 in the relatedness task). CONCLUSIONS: Increasing the size of dataset does not always enhance the performance. Increasing the size of datasets can result in the identification of more relations of biomedical terms even though it does not guarantee better precision. As summaries of research articles, compared with article bodies, abstracts excel in accuracy but lose in coverage of identifiable relations.
format	Online Article Text
id	pubmed-5496182
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-54961822017-07-05 Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec Zhu, Yongjun Yan, Erjia Wang, Fei BMC Med Inform Decis Mak Research Article BACKGROUND: Understanding semantic relatedness and similarity between biomedical terms has a great impact on a variety of applications such as biomedical information retrieval, information extraction, and recommender systems. The objective of this study is to examine word2vec’s ability in deriving semantic relatedness and similarity between biomedical terms from large publication data. Specifically, we focus on the effects of recency, size, and section of biomedical publication data on the performance of word2vec. METHODS: We download abstracts of 18,777,129 articles from PubMed and 766,326 full-text articles from PubMed Central (PMC). The datasets are preprocessed and grouped into subsets by recency, size, and section. Word2vec models are trained on these subtests. Cosine similarities between biomedical terms obtained from the word2vec models are compared against reference standards. Performance of models trained on different subsets are compared to examine recency, size, and section effects. RESULTS: Models trained on recent datasets did not boost the performance. Models trained on larger datasets identified more pairs of biomedical terms than models trained on smaller datasets in relatedness task (from 368 at the 10% level to 494 at the 100% level) and similarity task (from 374 at the 10% level to 491 at the 100% level). The model trained on abstracts produced results that have higher correlations with the reference standards than the one trained on article bodies (i.e., 0.65 vs. 0.62 in the similarity task and 0.66 vs. 0.59 in the relatedness task). However, the latter identified more pairs of biomedical terms than the former (i.e., 344 vs. 498 in the similarity task and 339 vs. 503 in the relatedness task). CONCLUSIONS: Increasing the size of dataset does not always enhance the performance. Increasing the size of datasets can result in the identification of more relations of biomedical terms even though it does not guarantee better precision. As summaries of research articles, compared with article bodies, abstracts excel in accuracy but lose in coverage of identifiable relations. BioMed Central 2017-07-03 /pmc/articles/PMC5496182/ /pubmed/28673289 http://dx.doi.org/10.1186/s12911-017-0498-1 Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Zhu, Yongjun Yan, Erjia Wang, Fei Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec
title	Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec
title_full	Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec
title_fullStr	Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec
title_full_unstemmed	Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec
title_short	Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec
title_sort	semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5496182/ https://www.ncbi.nlm.nih.gov/pubmed/28673289 http://dx.doi.org/10.1186/s12911-017-0498-1
work_keys_str_mv	AT zhuyongjun semanticrelatednessandsimilarityofbiomedicaltermsexaminingtheeffectsofrecencysizeandsectionofbiomedicalpublicationsontheperformanceofword2vec AT yanerjia semanticrelatednessandsimilarityofbiomedicaltermsexaminingtheeffectsofrecencysizeandsectionofbiomedicalpublicationsontheperformanceofword2vec AT wangfei semanticrelatednessandsimilarityofbiomedicaltermsexaminingtheeffectsofrecencysizeandsectionofbiomedicalpublicationsontheperformanceofword2vec

Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec

Ejemplares similares