Cargando…

A cascaded approach to normalising gene mentions in biomedical literature

Linking gene and protein names mentioned in the literature to unique identifiers in referent genomic databases is an essential step in accessing and integrating knowledge in the biomedical domain. However, it remains a challenging task due to lexical and terminological variation, and ambiguity of ge...

Descripción completa

Detalles Bibliográficos
Autores principales: Yang, Hui, Nenadic, Goran, Keane, John A
Formato: Texto
Lenguaje:English
Publicado: Biomedical Informatics Publishing Group 2007
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2241928/
https://www.ncbi.nlm.nih.gov/pubmed/18305829
_version_ 1782150558692409344
author Yang, Hui
Nenadic, Goran
Keane, John A
author_facet Yang, Hui
Nenadic, Goran
Keane, John A
author_sort Yang, Hui
collection PubMed
description Linking gene and protein names mentioned in the literature to unique identifiers in referent genomic databases is an essential step in accessing and integrating knowledge in the biomedical domain. However, it remains a challenging task due to lexical and terminological variation, and ambiguity of gene name mentions in documents. We present a generic and effective rule-based approach to link gene mentions in the literature to referent genomic databases, where pre-processing of both gene synonyms in the databases and gene mentions in text are first applied. The mapping method employs a cascaded approach, which combines exact, exact-like and token-based approximate matching by using flexible representations of a gene synonym dictionary and gene mentions generated during the pre-processing phase. We also consider multi-gene name mentions and permutation of components in gene names. A systematic evaluation of the suggested methods has identified steps that are beneficial for improving either precision or recall in gene name identification. The results of the experiments on the BioCreAtIvE2 data sets (identification of human gene names) demonstrated that our methods achieved highly encouraging results with F-measure of up to 81.20%.
format Text
id pubmed-2241928
institution National Center for Biotechnology Information
language English
publishDate 2007
publisher Biomedical Informatics Publishing Group
record_format MEDLINE/PubMed
spelling pubmed-22419282008-02-27 A cascaded approach to normalising gene mentions in biomedical literature Yang, Hui Nenadic, Goran Keane, John A Bioinformation Hypothesis Linking gene and protein names mentioned in the literature to unique identifiers in referent genomic databases is an essential step in accessing and integrating knowledge in the biomedical domain. However, it remains a challenging task due to lexical and terminological variation, and ambiguity of gene name mentions in documents. We present a generic and effective rule-based approach to link gene mentions in the literature to referent genomic databases, where pre-processing of both gene synonyms in the databases and gene mentions in text are first applied. The mapping method employs a cascaded approach, which combines exact, exact-like and token-based approximate matching by using flexible representations of a gene synonym dictionary and gene mentions generated during the pre-processing phase. We also consider multi-gene name mentions and permutation of components in gene names. A systematic evaluation of the suggested methods has identified steps that are beneficial for improving either precision or recall in gene name identification. The results of the experiments on the BioCreAtIvE2 data sets (identification of human gene names) demonstrated that our methods achieved highly encouraging results with F-measure of up to 81.20%. Biomedical Informatics Publishing Group 2007-12-30 /pmc/articles/PMC2241928/ /pubmed/18305829 Text en © 2007 Biomedical Informatics Publishing Group This is an open-access article, which permits unrestricted use, distribution, and reproduction in any medium, for non-commercial purposes, provided the original author and source are credited.
spellingShingle Hypothesis
Yang, Hui
Nenadic, Goran
Keane, John A
A cascaded approach to normalising gene mentions in biomedical literature
title A cascaded approach to normalising gene mentions in biomedical literature
title_full A cascaded approach to normalising gene mentions in biomedical literature
title_fullStr A cascaded approach to normalising gene mentions in biomedical literature
title_full_unstemmed A cascaded approach to normalising gene mentions in biomedical literature
title_short A cascaded approach to normalising gene mentions in biomedical literature
title_sort cascaded approach to normalising gene mentions in biomedical literature
topic Hypothesis
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2241928/
https://www.ncbi.nlm.nih.gov/pubmed/18305829
work_keys_str_mv AT yanghui acascadedapproachtonormalisinggenementionsinbiomedicalliterature
AT nenadicgoran acascadedapproachtonormalisinggenementionsinbiomedicalliterature
AT keanejohna acascadedapproachtonormalisinggenementionsinbiomedicalliterature
AT yanghui cascadedapproachtonormalisinggenementionsinbiomedicalliterature
AT nenadicgoran cascadedapproachtonormalisinggenementionsinbiomedicalliterature
AT keanejohna cascadedapproachtonormalisinggenementionsinbiomedicalliterature