Cargando…

Biomedical word sense disambiguation with ontologies and metadata: automation meets accuracy

BACKGROUND: Ontology term labels can be ambiguous and have multiple senses. While this is no problem for human annotators, it is a challenge to automated methods, which identify ontology terms in text. Classical approaches to word sense disambiguation use co-occurring words or terms. However, most t...

Descripción completa

Detalles Bibliográficos
Autores principales: Alexopoulou, Dimitra, Andreopoulos, Bill, Dietze, Heiko, Doms, Andreas, Gandon, Fabien, Hakenberg, Jörg, Khelif, Khaled, Schroeder, Michael, Wächter, Thomas
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2663782/
https://www.ncbi.nlm.nih.gov/pubmed/19159460
http://dx.doi.org/10.1186/1471-2105-10-28
_version_ 1782165923363291136
author Alexopoulou, Dimitra
Andreopoulos, Bill
Dietze, Heiko
Doms, Andreas
Gandon, Fabien
Hakenberg, Jörg
Khelif, Khaled
Schroeder, Michael
Wächter, Thomas
author_facet Alexopoulou, Dimitra
Andreopoulos, Bill
Dietze, Heiko
Doms, Andreas
Gandon, Fabien
Hakenberg, Jörg
Khelif, Khaled
Schroeder, Michael
Wächter, Thomas
author_sort Alexopoulou, Dimitra
collection PubMed
description BACKGROUND: Ontology term labels can be ambiguous and have multiple senses. While this is no problem for human annotators, it is a challenge to automated methods, which identify ontology terms in text. Classical approaches to word sense disambiguation use co-occurring words or terms. However, most treat ontologies as simple terminologies, without making use of the ontology structure or the semantic similarity between terms. Another useful source of information for disambiguation are metadata. Here, we systematically compare three approaches to word sense disambiguation, which use ontologies and metadata, respectively. RESULTS: The 'Closest Sense' method assumes that the ontology defines multiple senses of the term. It computes the shortest path of co-occurring terms in the document to one of these senses. The 'Term Cooc' method defines a log-odds ratio for co-occurring terms including co-occurrences inferred from the ontology structure. The 'MetaData' approach trains a classifier on metadata. It does not require any ontology, but requires training data, which the other methods do not. To evaluate these approaches we defined a manually curated training corpus of 2600 documents for seven ambiguous terms from the Gene Ontology and MeSH. All approaches over all conditions achieve 80% success rate on average. The 'MetaData' approach performed best with 96%, when trained on high-quality data. Its performance deteriorates as quality of the training data decreases. The 'Term Cooc' approach performs better on Gene Ontology (92% success) than on MeSH (73% success) as MeSH is not a strict is-a/part-of, but rather a loose is-related-to hierarchy. The 'Closest Sense' approach achieves on average 80% success rate. CONCLUSION: Metadata is valuable for disambiguation, but requires high quality training data. Closest Sense requires no training, but a large, consistently modelled ontology, which are two opposing conditions. Term Cooc achieves greater 90% success given a consistently modelled ontology. Overall, the results show that well structured ontologies can play a very important role to improve disambiguation. AVAILABILITY: The three benchmark datasets created for the purpose of disambiguation are available in Additional file 1.
format Text
id pubmed-2663782
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-26637822009-04-02 Biomedical word sense disambiguation with ontologies and metadata: automation meets accuracy Alexopoulou, Dimitra Andreopoulos, Bill Dietze, Heiko Doms, Andreas Gandon, Fabien Hakenberg, Jörg Khelif, Khaled Schroeder, Michael Wächter, Thomas BMC Bioinformatics Research Article BACKGROUND: Ontology term labels can be ambiguous and have multiple senses. While this is no problem for human annotators, it is a challenge to automated methods, which identify ontology terms in text. Classical approaches to word sense disambiguation use co-occurring words or terms. However, most treat ontologies as simple terminologies, without making use of the ontology structure or the semantic similarity between terms. Another useful source of information for disambiguation are metadata. Here, we systematically compare three approaches to word sense disambiguation, which use ontologies and metadata, respectively. RESULTS: The 'Closest Sense' method assumes that the ontology defines multiple senses of the term. It computes the shortest path of co-occurring terms in the document to one of these senses. The 'Term Cooc' method defines a log-odds ratio for co-occurring terms including co-occurrences inferred from the ontology structure. The 'MetaData' approach trains a classifier on metadata. It does not require any ontology, but requires training data, which the other methods do not. To evaluate these approaches we defined a manually curated training corpus of 2600 documents for seven ambiguous terms from the Gene Ontology and MeSH. All approaches over all conditions achieve 80% success rate on average. The 'MetaData' approach performed best with 96%, when trained on high-quality data. Its performance deteriorates as quality of the training data decreases. The 'Term Cooc' approach performs better on Gene Ontology (92% success) than on MeSH (73% success) as MeSH is not a strict is-a/part-of, but rather a loose is-related-to hierarchy. The 'Closest Sense' approach achieves on average 80% success rate. CONCLUSION: Metadata is valuable for disambiguation, but requires high quality training data. Closest Sense requires no training, but a large, consistently modelled ontology, which are two opposing conditions. Term Cooc achieves greater 90% success given a consistently modelled ontology. Overall, the results show that well structured ontologies can play a very important role to improve disambiguation. AVAILABILITY: The three benchmark datasets created for the purpose of disambiguation are available in Additional file 1. BioMed Central 2009-01-21 /pmc/articles/PMC2663782/ /pubmed/19159460 http://dx.doi.org/10.1186/1471-2105-10-28 Text en Copyright © 2009 Alexopoulou et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Alexopoulou, Dimitra
Andreopoulos, Bill
Dietze, Heiko
Doms, Andreas
Gandon, Fabien
Hakenberg, Jörg
Khelif, Khaled
Schroeder, Michael
Wächter, Thomas
Biomedical word sense disambiguation with ontologies and metadata: automation meets accuracy
title Biomedical word sense disambiguation with ontologies and metadata: automation meets accuracy
title_full Biomedical word sense disambiguation with ontologies and metadata: automation meets accuracy
title_fullStr Biomedical word sense disambiguation with ontologies and metadata: automation meets accuracy
title_full_unstemmed Biomedical word sense disambiguation with ontologies and metadata: automation meets accuracy
title_short Biomedical word sense disambiguation with ontologies and metadata: automation meets accuracy
title_sort biomedical word sense disambiguation with ontologies and metadata: automation meets accuracy
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2663782/
https://www.ncbi.nlm.nih.gov/pubmed/19159460
http://dx.doi.org/10.1186/1471-2105-10-28
work_keys_str_mv AT alexopouloudimitra biomedicalwordsensedisambiguationwithontologiesandmetadataautomationmeetsaccuracy
AT andreopoulosbill biomedicalwordsensedisambiguationwithontologiesandmetadataautomationmeetsaccuracy
AT dietzeheiko biomedicalwordsensedisambiguationwithontologiesandmetadataautomationmeetsaccuracy
AT domsandreas biomedicalwordsensedisambiguationwithontologiesandmetadataautomationmeetsaccuracy
AT gandonfabien biomedicalwordsensedisambiguationwithontologiesandmetadataautomationmeetsaccuracy
AT hakenbergjorg biomedicalwordsensedisambiguationwithontologiesandmetadataautomationmeetsaccuracy
AT khelifkhaled biomedicalwordsensedisambiguationwithontologiesandmetadataautomationmeetsaccuracy
AT schroedermichael biomedicalwordsensedisambiguationwithontologiesandmetadataautomationmeetsaccuracy
AT wachterthomas biomedicalwordsensedisambiguationwithontologiesandmetadataautomationmeetsaccuracy