Cargando…

Gene Ontology synonym generation rules lead to increased performance in biomedical concept recognition

BACKGROUND: Gene Ontology (GO) terms represent the standard for annotation and representation of molecular functions, biological processes and cellular compartments, but a large gap exists between the way concepts are represented in the ontology and how they are expressed in natural language text. T...

Descripción completa

Detalles Bibliográficos
Autores principales: Funk, Christopher S., Cohen, K. Bretonnel, Hunter, Lawrence E., Verspoor, Karin M.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5018193/
https://www.ncbi.nlm.nih.gov/pubmed/27613112
http://dx.doi.org/10.1186/s13326-016-0096-7
_version_ 1782452875486560256
author Funk, Christopher S.
Cohen, K. Bretonnel
Hunter, Lawrence E.
Verspoor, Karin M.
author_facet Funk, Christopher S.
Cohen, K. Bretonnel
Hunter, Lawrence E.
Verspoor, Karin M.
author_sort Funk, Christopher S.
collection PubMed
description BACKGROUND: Gene Ontology (GO) terms represent the standard for annotation and representation of molecular functions, biological processes and cellular compartments, but a large gap exists between the way concepts are represented in the ontology and how they are expressed in natural language text. The construction of highly specific GO terms is formulaic, consisting of parts and pieces from more simple terms. RESULTS: We present two different types of manually generated rules to help capture the variation of how GO terms can appear in natural language text. The first set of rules takes into account the compositional nature of GO and recursively decomposes the terms into their smallest constituent parts. The second set of rules generates derivational variations of these smaller terms and compositionally combines all generated variants to form the original term. By applying both types of rules, new synonyms are generated for two-thirds of all GO terms and an increase in F-measure performance for recognition of GO on the CRAFT corpus from 0.498 to 0.636 is observed. Additionally, we evaluated the combination of both types of rules over one million full text documents from Elsevier; manual validation and error analysis show we are able to recognize GO concepts with reasonable accuracy (88 %) based on random sampling of annotations. CONCLUSIONS: In this work we present a set of simple synonym generation rules that utilize the highly compositional and formulaic nature of the Gene Ontology concepts. We illustrate how the generated synonyms aid in improving recognition of GO concepts on two different biomedical corpora. We discuss other applications of our rules for GO ontology quality assurance, explore the issue of overgeneration, and provide examples of how similar methodologies could be applied to other biomedical terminologies. Additionally, we provide all generated synonyms for use by the text-mining community. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13326-016-0096-7) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5018193
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-50181932016-09-11 Gene Ontology synonym generation rules lead to increased performance in biomedical concept recognition Funk, Christopher S. Cohen, K. Bretonnel Hunter, Lawrence E. Verspoor, Karin M. J Biomed Semantics Research BACKGROUND: Gene Ontology (GO) terms represent the standard for annotation and representation of molecular functions, biological processes and cellular compartments, but a large gap exists between the way concepts are represented in the ontology and how they are expressed in natural language text. The construction of highly specific GO terms is formulaic, consisting of parts and pieces from more simple terms. RESULTS: We present two different types of manually generated rules to help capture the variation of how GO terms can appear in natural language text. The first set of rules takes into account the compositional nature of GO and recursively decomposes the terms into their smallest constituent parts. The second set of rules generates derivational variations of these smaller terms and compositionally combines all generated variants to form the original term. By applying both types of rules, new synonyms are generated for two-thirds of all GO terms and an increase in F-measure performance for recognition of GO on the CRAFT corpus from 0.498 to 0.636 is observed. Additionally, we evaluated the combination of both types of rules over one million full text documents from Elsevier; manual validation and error analysis show we are able to recognize GO concepts with reasonable accuracy (88 %) based on random sampling of annotations. CONCLUSIONS: In this work we present a set of simple synonym generation rules that utilize the highly compositional and formulaic nature of the Gene Ontology concepts. We illustrate how the generated synonyms aid in improving recognition of GO concepts on two different biomedical corpora. We discuss other applications of our rules for GO ontology quality assurance, explore the issue of overgeneration, and provide examples of how similar methodologies could be applied to other biomedical terminologies. Additionally, we provide all generated synonyms for use by the text-mining community. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13326-016-0096-7) contains supplementary material, which is available to authorized users. BioMed Central 2016-09-09 /pmc/articles/PMC5018193/ /pubmed/27613112 http://dx.doi.org/10.1186/s13326-016-0096-7 Text en © The Author(s) 2016 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Funk, Christopher S.
Cohen, K. Bretonnel
Hunter, Lawrence E.
Verspoor, Karin M.
Gene Ontology synonym generation rules lead to increased performance in biomedical concept recognition
title Gene Ontology synonym generation rules lead to increased performance in biomedical concept recognition
title_full Gene Ontology synonym generation rules lead to increased performance in biomedical concept recognition
title_fullStr Gene Ontology synonym generation rules lead to increased performance in biomedical concept recognition
title_full_unstemmed Gene Ontology synonym generation rules lead to increased performance in biomedical concept recognition
title_short Gene Ontology synonym generation rules lead to increased performance in biomedical concept recognition
title_sort gene ontology synonym generation rules lead to increased performance in biomedical concept recognition
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5018193/
https://www.ncbi.nlm.nih.gov/pubmed/27613112
http://dx.doi.org/10.1186/s13326-016-0096-7
work_keys_str_mv AT funkchristophers geneontologysynonymgenerationrulesleadtoincreasedperformanceinbiomedicalconceptrecognition
AT cohenkbretonnel geneontologysynonymgenerationrulesleadtoincreasedperformanceinbiomedicalconceptrecognition
AT hunterlawrencee geneontologysynonymgenerationrulesleadtoincreasedperformanceinbiomedicalconceptrecognition
AT verspoorkarinm geneontologysynonymgenerationrulesleadtoincreasedperformanceinbiomedicalconceptrecognition