Cargando…
Neural sentence embedding models for semantic similarity estimation in the biomedical domain
BACKGROUND: Neural network based embedding models are receiving significant attention in the field of natural language processing due to their capability to effectively capture semantic information representing words, sentences or even larger text elements in low-dimensional vector space. While curr...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6460644/ https://www.ncbi.nlm.nih.gov/pubmed/30975071 http://dx.doi.org/10.1186/s12859-019-2789-2 |
_version_ | 1783410358959996928 |
---|---|
author | Blagec, Kathrin Xu, Hong Agibetov, Asan Samwald, Matthias |
author_facet | Blagec, Kathrin Xu, Hong Agibetov, Asan Samwald, Matthias |
author_sort | Blagec, Kathrin |
collection | PubMed |
description | BACKGROUND: Neural network based embedding models are receiving significant attention in the field of natural language processing due to their capability to effectively capture semantic information representing words, sentences or even larger text elements in low-dimensional vector space. While current state-of-the-art models for assessing the semantic similarity of textual statements from biomedical publications depend on the availability of laboriously curated ontologies, unsupervised neural embedding models only require large text corpora as input and do not need manual curation. In this study, we investigated the efficacy of current state-of-the-art neural sentence embedding models for semantic similarity estimation of sentences from biomedical literature. We trained different neural embedding models on 1.7 million articles from the PubMed Open Access dataset, and evaluated them based on a biomedical benchmark set containing 100 sentence pairs annotated by human experts and a smaller contradiction subset derived from the original benchmark set. RESULTS: Experimental results showed that, with a Pearson correlation of 0.819, our best unsupervised model based on the Paragraph Vector Distributed Memory algorithm outperforms previous state-of-the-art results achieved on the BIOSSES biomedical benchmark set. Moreover, our proposed supervised model that combines different string-based similarity metrics with a neural embedding model surpasses previous ontology-dependent supervised state-of-the-art approaches in terms of Pearson’s r (r = 0.871) on the biomedical benchmark set. In contrast to the promising results for the original benchmark, we found our best models’ performance on the smaller contradiction subset to be poor. CONCLUSIONS: In this study, we have highlighted the value of neural network-based models for semantic similarity estimation in the biomedical domain by showing that they can keep up with and even surpass previous state-of-the-art approaches for semantic similarity estimation that depend on the availability of laboriously curated ontologies, when evaluated on a biomedical benchmark set. Capturing contradictions and negations in biomedical sentences, however, emerged as an essential area for further work. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-019-2789-2) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-6460644 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-64606442019-04-22 Neural sentence embedding models for semantic similarity estimation in the biomedical domain Blagec, Kathrin Xu, Hong Agibetov, Asan Samwald, Matthias BMC Bioinformatics Research Article BACKGROUND: Neural network based embedding models are receiving significant attention in the field of natural language processing due to their capability to effectively capture semantic information representing words, sentences or even larger text elements in low-dimensional vector space. While current state-of-the-art models for assessing the semantic similarity of textual statements from biomedical publications depend on the availability of laboriously curated ontologies, unsupervised neural embedding models only require large text corpora as input and do not need manual curation. In this study, we investigated the efficacy of current state-of-the-art neural sentence embedding models for semantic similarity estimation of sentences from biomedical literature. We trained different neural embedding models on 1.7 million articles from the PubMed Open Access dataset, and evaluated them based on a biomedical benchmark set containing 100 sentence pairs annotated by human experts and a smaller contradiction subset derived from the original benchmark set. RESULTS: Experimental results showed that, with a Pearson correlation of 0.819, our best unsupervised model based on the Paragraph Vector Distributed Memory algorithm outperforms previous state-of-the-art results achieved on the BIOSSES biomedical benchmark set. Moreover, our proposed supervised model that combines different string-based similarity metrics with a neural embedding model surpasses previous ontology-dependent supervised state-of-the-art approaches in terms of Pearson’s r (r = 0.871) on the biomedical benchmark set. In contrast to the promising results for the original benchmark, we found our best models’ performance on the smaller contradiction subset to be poor. CONCLUSIONS: In this study, we have highlighted the value of neural network-based models for semantic similarity estimation in the biomedical domain by showing that they can keep up with and even surpass previous state-of-the-art approaches for semantic similarity estimation that depend on the availability of laboriously curated ontologies, when evaluated on a biomedical benchmark set. Capturing contradictions and negations in biomedical sentences, however, emerged as an essential area for further work. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-019-2789-2) contains supplementary material, which is available to authorized users. BioMed Central 2019-04-11 /pmc/articles/PMC6460644/ /pubmed/30975071 http://dx.doi.org/10.1186/s12859-019-2789-2 Text en © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Article Blagec, Kathrin Xu, Hong Agibetov, Asan Samwald, Matthias Neural sentence embedding models for semantic similarity estimation in the biomedical domain |
title | Neural sentence embedding models for semantic similarity estimation in the biomedical domain |
title_full | Neural sentence embedding models for semantic similarity estimation in the biomedical domain |
title_fullStr | Neural sentence embedding models for semantic similarity estimation in the biomedical domain |
title_full_unstemmed | Neural sentence embedding models for semantic similarity estimation in the biomedical domain |
title_short | Neural sentence embedding models for semantic similarity estimation in the biomedical domain |
title_sort | neural sentence embedding models for semantic similarity estimation in the biomedical domain |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6460644/ https://www.ncbi.nlm.nih.gov/pubmed/30975071 http://dx.doi.org/10.1186/s12859-019-2789-2 |
work_keys_str_mv | AT blageckathrin neuralsentenceembeddingmodelsforsemanticsimilarityestimationinthebiomedicaldomain AT xuhong neuralsentenceembeddingmodelsforsemanticsimilarityestimationinthebiomedicaldomain AT agibetovasan neuralsentenceembeddingmodelsforsemanticsimilarityestimationinthebiomedicaldomain AT samwaldmatthias neuralsentenceembeddingmodelsforsemanticsimilarityestimationinthebiomedicaldomain |