Cargando…

Disclosing ambiguous gene aliases by automatic literature profiling

BACKGROUND: Retrieving pertinent information from biological scientific literature requires cutting-edge text mining methods which may be able to recognize the meaning of the very ambiguous names of biological entities. Aliases of a gene share a common vocabulary in their respective collections of P...

Descripción completa

Detalles Bibliográficos
Autores principales: Coimbra, Roney S, Vanderwall, Dana E, Oliveira, Guilherme C
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3045796/
https://www.ncbi.nlm.nih.gov/pubmed/21210969
http://dx.doi.org/10.1186/1471-2164-11-S5-S3
_version_ 1782198871508647936
author Coimbra, Roney S
Vanderwall, Dana E
Oliveira, Guilherme C
author_facet Coimbra, Roney S
Vanderwall, Dana E
Oliveira, Guilherme C
author_sort Coimbra, Roney S
collection PubMed
description BACKGROUND: Retrieving pertinent information from biological scientific literature requires cutting-edge text mining methods which may be able to recognize the meaning of the very ambiguous names of biological entities. Aliases of a gene share a common vocabulary in their respective collections of PubMed abstracts. This may be true even when these aliases are not associated with the same subset of documents. This gene-specific vocabulary defines a unique fingerprint that can be used to disclose ambiguous aliases. The present work describes an original method for automatically assessing the ambiguity levels of gene aliases in large gene terminologies based exclusively in the content of their associated literature. The method can deal with the two major problems restricting the usage of current text mining tools: 1) different names associated with the same gene; and 2) one name associated with multiple genes, or even with non-gene entities. Important, this method does not require training examples. RESULTS: Aliases were considered “ambiguous” when their Jaccard distance to the respective official gene symbol was equal or greater than the smallest distance between the official gene symbol and one of the three internal controls (randomly picked unrelated official gene symbols). Otherwise, they were assigned the status of “synonyms”. We evaluated the coherence of the results by comparing the frequencies of the official gene symbols in the text corpora retrieved with their respective “synonyms” or “ambiguous” aliases. Official gene symbols were mentioned in the abstract collections of 42 % (70/165) of their respective synonyms. No official gene symbol occurred in the abstract collections of any of their respective ambiguous aliases. In overall, querying PubMed with official gene symbols and “synonym” aliases allowed a 3.6-fold increase in the number of unique documents retrieved. CONCLUSIONS: These results confirm that this method is able to distinguish between synonyms and ambiguous gene aliases based exclusively on their vocabulary fingerprint. The approach we describe could be used to enhance the retrieval of relevant literature related to a gene.
format Text
id pubmed-3045796
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-30457962011-03-01 Disclosing ambiguous gene aliases by automatic literature profiling Coimbra, Roney S Vanderwall, Dana E Oliveira, Guilherme C BMC Genomics Proceedings BACKGROUND: Retrieving pertinent information from biological scientific literature requires cutting-edge text mining methods which may be able to recognize the meaning of the very ambiguous names of biological entities. Aliases of a gene share a common vocabulary in their respective collections of PubMed abstracts. This may be true even when these aliases are not associated with the same subset of documents. This gene-specific vocabulary defines a unique fingerprint that can be used to disclose ambiguous aliases. The present work describes an original method for automatically assessing the ambiguity levels of gene aliases in large gene terminologies based exclusively in the content of their associated literature. The method can deal with the two major problems restricting the usage of current text mining tools: 1) different names associated with the same gene; and 2) one name associated with multiple genes, or even with non-gene entities. Important, this method does not require training examples. RESULTS: Aliases were considered “ambiguous” when their Jaccard distance to the respective official gene symbol was equal or greater than the smallest distance between the official gene symbol and one of the three internal controls (randomly picked unrelated official gene symbols). Otherwise, they were assigned the status of “synonyms”. We evaluated the coherence of the results by comparing the frequencies of the official gene symbols in the text corpora retrieved with their respective “synonyms” or “ambiguous” aliases. Official gene symbols were mentioned in the abstract collections of 42 % (70/165) of their respective synonyms. No official gene symbol occurred in the abstract collections of any of their respective ambiguous aliases. In overall, querying PubMed with official gene symbols and “synonym” aliases allowed a 3.6-fold increase in the number of unique documents retrieved. CONCLUSIONS: These results confirm that this method is able to distinguish between synonyms and ambiguous gene aliases based exclusively on their vocabulary fingerprint. The approach we describe could be used to enhance the retrieval of relevant literature related to a gene. BioMed Central 2010-12-22 /pmc/articles/PMC3045796/ /pubmed/21210969 http://dx.doi.org/10.1186/1471-2164-11-S5-S3 Text en Copyright ©2010 Coimbra et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Proceedings
Coimbra, Roney S
Vanderwall, Dana E
Oliveira, Guilherme C
Disclosing ambiguous gene aliases by automatic literature profiling
title Disclosing ambiguous gene aliases by automatic literature profiling
title_full Disclosing ambiguous gene aliases by automatic literature profiling
title_fullStr Disclosing ambiguous gene aliases by automatic literature profiling
title_full_unstemmed Disclosing ambiguous gene aliases by automatic literature profiling
title_short Disclosing ambiguous gene aliases by automatic literature profiling
title_sort disclosing ambiguous gene aliases by automatic literature profiling
topic Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3045796/
https://www.ncbi.nlm.nih.gov/pubmed/21210969
http://dx.doi.org/10.1186/1471-2164-11-S5-S3
work_keys_str_mv AT coimbraroneys disclosingambiguousgenealiasesbyautomaticliteratureprofiling
AT vanderwalldanae disclosingambiguousgenealiasesbyautomaticliteratureprofiling
AT oliveiraguilhermec disclosingambiguousgenealiasesbyautomaticliteratureprofiling