Cargando…

Deep learning meets ontologies: experiments to anchor the cardiovascular disease ontology in the biomedical literature

BACKGROUND: Automatic identification of term variants or acceptable alternative free-text terms for gene and protein names from the millions of biomedical publications is a challenging task. Ontologies, such as the Cardiovascular Disease Ontology (CVDO), capture domain knowledge in a computational f...

Descripción completa

Detalles Bibliográficos
Autores principales: Arguello Casteleiro, Mercedes, Demetriou, George, Read, Warren, Fernandez Prieto, Maria Jesus, Maroto, Nava, Maseda Fernandez, Diego, Nenadic, Goran, Klein, Julie, Keane, John, Stevens, Robert
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5896136/
https://www.ncbi.nlm.nih.gov/pubmed/29650041
http://dx.doi.org/10.1186/s13326-018-0181-1
_version_ 1783313783658119168
author Arguello Casteleiro, Mercedes
Demetriou, George
Read, Warren
Fernandez Prieto, Maria Jesus
Maroto, Nava
Maseda Fernandez, Diego
Nenadic, Goran
Klein, Julie
Keane, John
Stevens, Robert
author_facet Arguello Casteleiro, Mercedes
Demetriou, George
Read, Warren
Fernandez Prieto, Maria Jesus
Maroto, Nava
Maseda Fernandez, Diego
Nenadic, Goran
Klein, Julie
Keane, John
Stevens, Robert
author_sort Arguello Casteleiro, Mercedes
collection PubMed
description BACKGROUND: Automatic identification of term variants or acceptable alternative free-text terms for gene and protein names from the millions of biomedical publications is a challenging task. Ontologies, such as the Cardiovascular Disease Ontology (CVDO), capture domain knowledge in a computational form and can provide context for gene/protein names as written in the literature. This study investigates: 1) if word embeddings from Deep Learning algorithms can provide a list of term variants for a given gene/protein of interest; and 2) if biological knowledge from the CVDO can improve such a list without modifying the word embeddings created. METHODS: We have manually annotated 105 gene/protein names from 25 PubMed titles/abstracts and mapped them to 79 unique UniProtKB entries corresponding to gene and protein classes from the CVDO. Using more than 14 M PubMed articles (titles and available abstracts), word embeddings were generated with CBOW and Skip-gram. We setup two experiments for a synonym detection task, each with four raters, and 3672 pairs of terms (target term and candidate term) from the word embeddings created. For Experiment I, the target terms for 64 UniProtKB entries were those that appear in the titles/abstracts; Experiment II involves 63 UniProtKB entries and the target terms are a combination of terms from PubMed titles/abstracts with terms (i.e. increased context) from the CVDO protein class expressions and labels. RESULTS: In Experiment I, Skip-gram finds term variants (full and/or partial) for 89% of the 64 UniProtKB entries, while CBOW finds term variants for 67%. In Experiment II (with the aid of the CVDO), Skip-gram finds term variants for 95% of the 63 UniProtKB entries, while CBOW finds term variants for 78%. Combining the results of both experiments, Skip-gram finds term variants for 97% of the 79 UniProtKB entries, while CBOW finds term variants for 81%. CONCLUSIONS: This study shows performance improvements for both CBOW and Skip-gram on a gene/protein synonym detection task by adding knowledge formalised in the CVDO and without modifying the word embeddings created. Hence, the CVDO supplies context that is effective in inducing term variability for both CBOW and Skip-gram while reducing ambiguity. Skip-gram outperforms CBOW and finds more pertinent term variants for gene/protein names annotated from the scientific literature. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s13326-018-0181-1) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5896136
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-58961362018-04-20 Deep learning meets ontologies: experiments to anchor the cardiovascular disease ontology in the biomedical literature Arguello Casteleiro, Mercedes Demetriou, George Read, Warren Fernandez Prieto, Maria Jesus Maroto, Nava Maseda Fernandez, Diego Nenadic, Goran Klein, Julie Keane, John Stevens, Robert J Biomed Semantics Research BACKGROUND: Automatic identification of term variants or acceptable alternative free-text terms for gene and protein names from the millions of biomedical publications is a challenging task. Ontologies, such as the Cardiovascular Disease Ontology (CVDO), capture domain knowledge in a computational form and can provide context for gene/protein names as written in the literature. This study investigates: 1) if word embeddings from Deep Learning algorithms can provide a list of term variants for a given gene/protein of interest; and 2) if biological knowledge from the CVDO can improve such a list without modifying the word embeddings created. METHODS: We have manually annotated 105 gene/protein names from 25 PubMed titles/abstracts and mapped them to 79 unique UniProtKB entries corresponding to gene and protein classes from the CVDO. Using more than 14 M PubMed articles (titles and available abstracts), word embeddings were generated with CBOW and Skip-gram. We setup two experiments for a synonym detection task, each with four raters, and 3672 pairs of terms (target term and candidate term) from the word embeddings created. For Experiment I, the target terms for 64 UniProtKB entries were those that appear in the titles/abstracts; Experiment II involves 63 UniProtKB entries and the target terms are a combination of terms from PubMed titles/abstracts with terms (i.e. increased context) from the CVDO protein class expressions and labels. RESULTS: In Experiment I, Skip-gram finds term variants (full and/or partial) for 89% of the 64 UniProtKB entries, while CBOW finds term variants for 67%. In Experiment II (with the aid of the CVDO), Skip-gram finds term variants for 95% of the 63 UniProtKB entries, while CBOW finds term variants for 78%. Combining the results of both experiments, Skip-gram finds term variants for 97% of the 79 UniProtKB entries, while CBOW finds term variants for 81%. CONCLUSIONS: This study shows performance improvements for both CBOW and Skip-gram on a gene/protein synonym detection task by adding knowledge formalised in the CVDO and without modifying the word embeddings created. Hence, the CVDO supplies context that is effective in inducing term variability for both CBOW and Skip-gram while reducing ambiguity. Skip-gram outperforms CBOW and finds more pertinent term variants for gene/protein names annotated from the scientific literature. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s13326-018-0181-1) contains supplementary material, which is available to authorized users. BioMed Central 2018-04-12 /pmc/articles/PMC5896136/ /pubmed/29650041 http://dx.doi.org/10.1186/s13326-018-0181-1 Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Arguello Casteleiro, Mercedes
Demetriou, George
Read, Warren
Fernandez Prieto, Maria Jesus
Maroto, Nava
Maseda Fernandez, Diego
Nenadic, Goran
Klein, Julie
Keane, John
Stevens, Robert
Deep learning meets ontologies: experiments to anchor the cardiovascular disease ontology in the biomedical literature
title Deep learning meets ontologies: experiments to anchor the cardiovascular disease ontology in the biomedical literature
title_full Deep learning meets ontologies: experiments to anchor the cardiovascular disease ontology in the biomedical literature
title_fullStr Deep learning meets ontologies: experiments to anchor the cardiovascular disease ontology in the biomedical literature
title_full_unstemmed Deep learning meets ontologies: experiments to anchor the cardiovascular disease ontology in the biomedical literature
title_short Deep learning meets ontologies: experiments to anchor the cardiovascular disease ontology in the biomedical literature
title_sort deep learning meets ontologies: experiments to anchor the cardiovascular disease ontology in the biomedical literature
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5896136/
https://www.ncbi.nlm.nih.gov/pubmed/29650041
http://dx.doi.org/10.1186/s13326-018-0181-1
work_keys_str_mv AT arguellocasteleiromercedes deeplearningmeetsontologiesexperimentstoanchorthecardiovasculardiseaseontologyinthebiomedicalliterature
AT demetriougeorge deeplearningmeetsontologiesexperimentstoanchorthecardiovasculardiseaseontologyinthebiomedicalliterature
AT readwarren deeplearningmeetsontologiesexperimentstoanchorthecardiovasculardiseaseontologyinthebiomedicalliterature
AT fernandezprietomariajesus deeplearningmeetsontologiesexperimentstoanchorthecardiovasculardiseaseontologyinthebiomedicalliterature
AT marotonava deeplearningmeetsontologiesexperimentstoanchorthecardiovasculardiseaseontologyinthebiomedicalliterature
AT masedafernandezdiego deeplearningmeetsontologiesexperimentstoanchorthecardiovasculardiseaseontologyinthebiomedicalliterature
AT nenadicgoran deeplearningmeetsontologiesexperimentstoanchorthecardiovasculardiseaseontologyinthebiomedicalliterature
AT kleinjulie deeplearningmeetsontologiesexperimentstoanchorthecardiovasculardiseaseontologyinthebiomedicalliterature
AT keanejohn deeplearningmeetsontologiesexperimentstoanchorthecardiovasculardiseaseontologyinthebiomedicalliterature
AT stevensrobert deeplearningmeetsontologiesexperimentstoanchorthecardiovasculardiseaseontologyinthebiomedicalliterature