Cargando…

Neural sentence embedding models for semantic similarity estimation in the biomedical domain

BACKGROUND: Neural network based embedding models are receiving significant attention in the field of natural language processing due to their capability to effectively capture semantic information representing words, sentences or even larger text elements in low-dimensional vector space. While curr...

Descripción completa

Detalles Bibliográficos
Autores principales:	Blagec, Kathrin, Xu, Hong, Agibetov, Asan, Samwald, Matthias
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6460644/ https://www.ncbi.nlm.nih.gov/pubmed/30975071 http://dx.doi.org/10.1186/s12859-019-2789-2

_version_	1783410358959996928
author	Blagec, Kathrin Xu, Hong Agibetov, Asan Samwald, Matthias
author_facet	Blagec, Kathrin Xu, Hong Agibetov, Asan Samwald, Matthias
author_sort	Blagec, Kathrin
collection	PubMed
description	BACKGROUND: Neural network based embedding models are receiving significant attention in the field of natural language processing due to their capability to effectively capture semantic information representing words, sentences or even larger text elements in low-dimensional vector space. While current state-of-the-art models for assessing the semantic similarity of textual statements from biomedical publications depend on the availability of laboriously curated ontologies, unsupervised neural embedding models only require large text corpora as input and do not need manual curation. In this study, we investigated the efficacy of current state-of-the-art neural sentence embedding models for semantic similarity estimation of sentences from biomedical literature. We trained different neural embedding models on 1.7 million articles from the PubMed Open Access dataset, and evaluated them based on a biomedical benchmark set containing 100 sentence pairs annotated by human experts and a smaller contradiction subset derived from the original benchmark set. RESULTS: Experimental results showed that, with a Pearson correlation of 0.819, our best unsupervised model based on the Paragraph Vector Distributed Memory algorithm outperforms previous state-of-the-art results achieved on the BIOSSES biomedical benchmark set. Moreover, our proposed supervised model that combines different string-based similarity metrics with a neural embedding model surpasses previous ontology-dependent supervised state-of-the-art approaches in terms of Pearson’s r (r = 0.871) on the biomedical benchmark set. In contrast to the promising results for the original benchmark, we found our best models’ performance on the smaller contradiction subset to be poor. CONCLUSIONS: In this study, we have highlighted the value of neural network-based models for semantic similarity estimation in the biomedical domain by showing that they can keep up with and even surpass previous state-of-the-art approaches for semantic similarity estimation that depend on the availability of laboriously curated ontologies, when evaluated on a biomedical benchmark set. Capturing contradictions and negations in biomedical sentences, however, emerged as an essential area for further work. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-019-2789-2) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-6460644
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-64606442019-04-22 Neural sentence embedding models for semantic similarity estimation in the biomedical domain Blagec, Kathrin Xu, Hong Agibetov, Asan Samwald, Matthias BMC Bioinformatics Research Article BACKGROUND: Neural network based embedding models are receiving significant attention in the field of natural language processing due to their capability to effectively capture semantic information representing words, sentences or even larger text elements in low-dimensional vector space. While current state-of-the-art models for assessing the semantic similarity of textual statements from biomedical publications depend on the availability of laboriously curated ontologies, unsupervised neural embedding models only require large text corpora as input and do not need manual curation. In this study, we investigated the efficacy of current state-of-the-art neural sentence embedding models for semantic similarity estimation of sentences from biomedical literature. We trained different neural embedding models on 1.7 million articles from the PubMed Open Access dataset, and evaluated them based on a biomedical benchmark set containing 100 sentence pairs annotated by human experts and a smaller contradiction subset derived from the original benchmark set. RESULTS: Experimental results showed that, with a Pearson correlation of 0.819, our best unsupervised model based on the Paragraph Vector Distributed Memory algorithm outperforms previous state-of-the-art results achieved on the BIOSSES biomedical benchmark set. Moreover, our proposed supervised model that combines different string-based similarity metrics with a neural embedding model surpasses previous ontology-dependent supervised state-of-the-art approaches in terms of Pearson’s r (r = 0.871) on the biomedical benchmark set. In contrast to the promising results for the original benchmark, we found our best models’ performance on the smaller contradiction subset to be poor. CONCLUSIONS: In this study, we have highlighted the value of neural network-based models for semantic similarity estimation in the biomedical domain by showing that they can keep up with and even surpass previous state-of-the-art approaches for semantic similarity estimation that depend on the availability of laboriously curated ontologies, when evaluated on a biomedical benchmark set. Capturing contradictions and negations in biomedical sentences, however, emerged as an essential area for further work. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-019-2789-2) contains supplementary material, which is available to authorized users. BioMed Central 2019-04-11 /pmc/articles/PMC6460644/ /pubmed/30975071 http://dx.doi.org/10.1186/s12859-019-2789-2 Text en © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Blagec, Kathrin Xu, Hong Agibetov, Asan Samwald, Matthias Neural sentence embedding models for semantic similarity estimation in the biomedical domain
title	Neural sentence embedding models for semantic similarity estimation in the biomedical domain
title_full	Neural sentence embedding models for semantic similarity estimation in the biomedical domain
title_fullStr	Neural sentence embedding models for semantic similarity estimation in the biomedical domain
title_full_unstemmed	Neural sentence embedding models for semantic similarity estimation in the biomedical domain
title_short	Neural sentence embedding models for semantic similarity estimation in the biomedical domain
title_sort	neural sentence embedding models for semantic similarity estimation in the biomedical domain
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6460644/ https://www.ncbi.nlm.nih.gov/pubmed/30975071 http://dx.doi.org/10.1186/s12859-019-2789-2
work_keys_str_mv	AT blageckathrin neuralsentenceembeddingmodelsforsemanticsimilarityestimationinthebiomedicaldomain AT xuhong neuralsentenceembeddingmodelsforsemanticsimilarityestimationinthebiomedicaldomain AT agibetovasan neuralsentenceembeddingmodelsforsemanticsimilarityestimationinthebiomedicaldomain AT samwaldmatthias neuralsentenceembeddingmodelsforsemanticsimilarityestimationinthebiomedicaldomain

Neural sentence embedding models for semantic similarity estimation in the biomedical domain

Ejemplares similares