Cargando…

Discovering gene annotations in biomedical text databases

BACKGROUND: Genes and gene products are frequently annotated with Gene Ontology concepts based on the evidence provided in genomics articles. Manually locating and curating information about a genomic entity from the biomedical literature requires vast amounts of human effort. Hence, there is clearl...

Descripción completa

Detalles Bibliográficos
Autores principales: Cakmak, Ali, Ozsoyoglu, Gultekin
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2335285/
https://www.ncbi.nlm.nih.gov/pubmed/18325104
http://dx.doi.org/10.1186/1471-2105-9-143
_version_ 1782152816627810304
author Cakmak, Ali
Ozsoyoglu, Gultekin
author_facet Cakmak, Ali
Ozsoyoglu, Gultekin
author_sort Cakmak, Ali
collection PubMed
description BACKGROUND: Genes and gene products are frequently annotated with Gene Ontology concepts based on the evidence provided in genomics articles. Manually locating and curating information about a genomic entity from the biomedical literature requires vast amounts of human effort. Hence, there is clearly a need forautomated computational tools to annotate the genes and gene products with Gene Ontology concepts by computationally capturing the related knowledge embedded in textual data. RESULTS: In this article, we present an automated genomic entity annotation system, GEANN, which extracts information about the characteristics of genes and gene products in article abstracts from PubMed, and translates the discoveredknowledge into Gene Ontology (GO) concepts, a widely-used standardized vocabulary of genomic traits. GEANN utilizes textual "extraction patterns", and a semantic matching framework to locate phrases matching to a pattern and produce Gene Ontology annotations for genes and gene products. In our experiments, GEANN has reached to the precision level of 78% at therecall level of 61%. On a select set of Gene Ontology concepts, GEANN either outperforms or is comparable to two other automated annotation studies. Use of WordNet for semantic pattern matching improves the precision and recall by 24% and 15%, respectively, and the improvement due to semantic pattern matching becomes more apparent as the Gene Ontology terms become more general. CONCLUSION: GEANN is useful for two distinct purposes: (i) automating the annotation of genomic entities with Gene Ontology concepts, and (ii) providing existing annotations with additional "evidence articles" from the literature. The use of textual extraction patterns that are constructed based on the existing annotations achieve high precision. The semantic pattern matching framework provides a more flexible pattern matching scheme with respect to "exactmatching" with the advantage of locating approximate pattern occurrences with similar semantics. Relatively low recall performance of our pattern-based approach may be enhanced either by employing a probabilistic annotation framework based on the annotation neighbourhoods in textual data, or, alternatively, the statistical enrichment threshold may be adjusted to lower values for applications that put more value on achieving higher recall values.
format Text
id pubmed-2335285
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-23352852008-04-28 Discovering gene annotations in biomedical text databases Cakmak, Ali Ozsoyoglu, Gultekin BMC Bioinformatics Research Article BACKGROUND: Genes and gene products are frequently annotated with Gene Ontology concepts based on the evidence provided in genomics articles. Manually locating and curating information about a genomic entity from the biomedical literature requires vast amounts of human effort. Hence, there is clearly a need forautomated computational tools to annotate the genes and gene products with Gene Ontology concepts by computationally capturing the related knowledge embedded in textual data. RESULTS: In this article, we present an automated genomic entity annotation system, GEANN, which extracts information about the characteristics of genes and gene products in article abstracts from PubMed, and translates the discoveredknowledge into Gene Ontology (GO) concepts, a widely-used standardized vocabulary of genomic traits. GEANN utilizes textual "extraction patterns", and a semantic matching framework to locate phrases matching to a pattern and produce Gene Ontology annotations for genes and gene products. In our experiments, GEANN has reached to the precision level of 78% at therecall level of 61%. On a select set of Gene Ontology concepts, GEANN either outperforms or is comparable to two other automated annotation studies. Use of WordNet for semantic pattern matching improves the precision and recall by 24% and 15%, respectively, and the improvement due to semantic pattern matching becomes more apparent as the Gene Ontology terms become more general. CONCLUSION: GEANN is useful for two distinct purposes: (i) automating the annotation of genomic entities with Gene Ontology concepts, and (ii) providing existing annotations with additional "evidence articles" from the literature. The use of textual extraction patterns that are constructed based on the existing annotations achieve high precision. The semantic pattern matching framework provides a more flexible pattern matching scheme with respect to "exactmatching" with the advantage of locating approximate pattern occurrences with similar semantics. Relatively low recall performance of our pattern-based approach may be enhanced either by employing a probabilistic annotation framework based on the annotation neighbourhoods in textual data, or, alternatively, the statistical enrichment threshold may be adjusted to lower values for applications that put more value on achieving higher recall values. BioMed Central 2008-03-06 /pmc/articles/PMC2335285/ /pubmed/18325104 http://dx.doi.org/10.1186/1471-2105-9-143 Text en Copyright © 2008 Cakmak and Ozsoyoglu; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Cakmak, Ali
Ozsoyoglu, Gultekin
Discovering gene annotations in biomedical text databases
title Discovering gene annotations in biomedical text databases
title_full Discovering gene annotations in biomedical text databases
title_fullStr Discovering gene annotations in biomedical text databases
title_full_unstemmed Discovering gene annotations in biomedical text databases
title_short Discovering gene annotations in biomedical text databases
title_sort discovering gene annotations in biomedical text databases
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2335285/
https://www.ncbi.nlm.nih.gov/pubmed/18325104
http://dx.doi.org/10.1186/1471-2105-9-143
work_keys_str_mv AT cakmakali discoveringgeneannotationsinbiomedicaltextdatabases
AT ozsoyoglugultekin discoveringgeneannotationsinbiomedicaltextdatabases