Cargando…

Neural sentence embedding models for semantic similarity estimation in the biomedical domain

BACKGROUND: Neural network based embedding models are receiving significant attention in the field of natural language processing due to their capability to effectively capture semantic information representing words, sentences or even larger text elements in low-dimensional vector space. While curr...

Descripción completa

Detalles Bibliográficos
Autores principales: Blagec, Kathrin, Xu, Hong, Agibetov, Asan, Samwald, Matthias
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6460644/
https://www.ncbi.nlm.nih.gov/pubmed/30975071
http://dx.doi.org/10.1186/s12859-019-2789-2
_version_ 1783410358959996928
author Blagec, Kathrin
Xu, Hong
Agibetov, Asan
Samwald, Matthias
author_facet Blagec, Kathrin
Xu, Hong
Agibetov, Asan
Samwald, Matthias
author_sort Blagec, Kathrin
collection PubMed
description BACKGROUND: Neural network based embedding models are receiving significant attention in the field of natural language processing due to their capability to effectively capture semantic information representing words, sentences or even larger text elements in low-dimensional vector space. While current state-of-the-art models for assessing the semantic similarity of textual statements from biomedical publications depend on the availability of laboriously curated ontologies, unsupervised neural embedding models only require large text corpora as input and do not need manual curation. In this study, we investigated the efficacy of current state-of-the-art neural sentence embedding models for semantic similarity estimation of sentences from biomedical literature. We trained different neural embedding models on 1.7 million articles from the PubMed Open Access dataset, and evaluated them based on a biomedical benchmark set containing 100 sentence pairs annotated by human experts and a smaller contradiction subset derived from the original benchmark set. RESULTS: Experimental results showed that, with a Pearson correlation of 0.819, our best unsupervised model based on the Paragraph Vector Distributed Memory algorithm outperforms previous state-of-the-art results achieved on the BIOSSES biomedical benchmark set. Moreover, our proposed supervised model that combines different string-based similarity metrics with a neural embedding model surpasses previous ontology-dependent supervised state-of-the-art approaches in terms of Pearson’s r (r = 0.871) on the biomedical benchmark set. In contrast to the promising results for the original benchmark, we found our best models’ performance on the smaller contradiction subset to be poor. CONCLUSIONS: In this study, we have highlighted the value of neural network-based models for semantic similarity estimation in the biomedical domain by showing that they can keep up with and even surpass previous state-of-the-art approaches for semantic similarity estimation that depend on the availability of laboriously curated ontologies, when evaluated on a biomedical benchmark set. Capturing contradictions and negations in biomedical sentences, however, emerged as an essential area for further work. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-019-2789-2) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-6460644
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-64606442019-04-22 Neural sentence embedding models for semantic similarity estimation in the biomedical domain Blagec, Kathrin Xu, Hong Agibetov, Asan Samwald, Matthias BMC Bioinformatics Research Article BACKGROUND: Neural network based embedding models are receiving significant attention in the field of natural language processing due to their capability to effectively capture semantic information representing words, sentences or even larger text elements in low-dimensional vector space. While current state-of-the-art models for assessing the semantic similarity of textual statements from biomedical publications depend on the availability of laboriously curated ontologies, unsupervised neural embedding models only require large text corpora as input and do not need manual curation. In this study, we investigated the efficacy of current state-of-the-art neural sentence embedding models for semantic similarity estimation of sentences from biomedical literature. We trained different neural embedding models on 1.7 million articles from the PubMed Open Access dataset, and evaluated them based on a biomedical benchmark set containing 100 sentence pairs annotated by human experts and a smaller contradiction subset derived from the original benchmark set. RESULTS: Experimental results showed that, with a Pearson correlation of 0.819, our best unsupervised model based on the Paragraph Vector Distributed Memory algorithm outperforms previous state-of-the-art results achieved on the BIOSSES biomedical benchmark set. Moreover, our proposed supervised model that combines different string-based similarity metrics with a neural embedding model surpasses previous ontology-dependent supervised state-of-the-art approaches in terms of Pearson’s r (r = 0.871) on the biomedical benchmark set. In contrast to the promising results for the original benchmark, we found our best models’ performance on the smaller contradiction subset to be poor. CONCLUSIONS: In this study, we have highlighted the value of neural network-based models for semantic similarity estimation in the biomedical domain by showing that they can keep up with and even surpass previous state-of-the-art approaches for semantic similarity estimation that depend on the availability of laboriously curated ontologies, when evaluated on a biomedical benchmark set. Capturing contradictions and negations in biomedical sentences, however, emerged as an essential area for further work. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-019-2789-2) contains supplementary material, which is available to authorized users. BioMed Central 2019-04-11 /pmc/articles/PMC6460644/ /pubmed/30975071 http://dx.doi.org/10.1186/s12859-019-2789-2 Text en © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Blagec, Kathrin
Xu, Hong
Agibetov, Asan
Samwald, Matthias
Neural sentence embedding models for semantic similarity estimation in the biomedical domain
title Neural sentence embedding models for semantic similarity estimation in the biomedical domain
title_full Neural sentence embedding models for semantic similarity estimation in the biomedical domain
title_fullStr Neural sentence embedding models for semantic similarity estimation in the biomedical domain
title_full_unstemmed Neural sentence embedding models for semantic similarity estimation in the biomedical domain
title_short Neural sentence embedding models for semantic similarity estimation in the biomedical domain
title_sort neural sentence embedding models for semantic similarity estimation in the biomedical domain
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6460644/
https://www.ncbi.nlm.nih.gov/pubmed/30975071
http://dx.doi.org/10.1186/s12859-019-2789-2
work_keys_str_mv AT blageckathrin neuralsentenceembeddingmodelsforsemanticsimilarityestimationinthebiomedicaldomain
AT xuhong neuralsentenceembeddingmodelsforsemanticsimilarityestimationinthebiomedicaldomain
AT agibetovasan neuralsentenceembeddingmodelsforsemanticsimilarityestimationinthebiomedicaldomain
AT samwaldmatthias neuralsentenceembeddingmodelsforsemanticsimilarityestimationinthebiomedicaldomain