Cargando…

The strength of co-authorship in gene name disambiguation

BACKGROUND: A biomedical entity mention in articles and other free texts is often ambiguous. For example, 13% of the gene names (aliases) might refer to more than one gene. The task of Gene Symbol Disambiguation (GSD) – a special case of Word Sense Disambiguation (WSD) – is to assign a unique gene i...

Descripción completa

Detalles Bibliográficos
Autor principal: Farkas, Richárd
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2262057/
https://www.ncbi.nlm.nih.gov/pubmed/18230174
http://dx.doi.org/10.1186/1471-2105-9-69
_version_ 1782151398326009856
author Farkas, Richárd
author_facet Farkas, Richárd
author_sort Farkas, Richárd
collection PubMed
description BACKGROUND: A biomedical entity mention in articles and other free texts is often ambiguous. For example, 13% of the gene names (aliases) might refer to more than one gene. The task of Gene Symbol Disambiguation (GSD) – a special case of Word Sense Disambiguation (WSD) – is to assign a unique gene identifier for all identified gene name aliases in biology-related articles. Supervised and unsupervised machine learning WSD techniques have been applied in the biomedical field with promising results. We examine here the utilisation potential of the fact – one of the special features of biological articles – that the authors of the documents are known through graph-based semi-supervised methods for the GSD task. RESULTS: Our key hypothesis is that a biologist refers to each particular gene by a fixed gene alias and this holds for the co-authors as well. To make use of the co-authorship information we decided to build the inverse co-author graph on MedLine abstracts. The nodes of the inverse co-author graph are articles and there is an edge between two nodes if and only if the two articles have a mutual author. We introduce here two methods using distances (based on the graph) of abstracts for the GSD task. We found that a disambiguation decision can be made in 85% of cases with an extremely high (99.5%) precision rate just by using information obtained from the inverse co-author graph. We incorporated the co-authorship information into two GSD systems in order to attain full coverage and in experiments our procedure achieved precision of 94.3%, 98.85%, 96.05% and 99.63% on the human, mouse, fly and yeast GSD evaluation sets, respectively. CONCLUSION: Based on the promising results obtained so far we suggest that the co-authorship information and the circumstances of the articles' release (like the title of the journal, the year of publication) can be a crucial building block of any sophisticated similarity measure among biological articles and hence the methods introduced here should be useful for other biomedical natural language processing tasks (like organism or target disease detection) as well.
format Text
id pubmed-2262057
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-22620572008-03-04 The strength of co-authorship in gene name disambiguation Farkas, Richárd BMC Bioinformatics Methodology Article BACKGROUND: A biomedical entity mention in articles and other free texts is often ambiguous. For example, 13% of the gene names (aliases) might refer to more than one gene. The task of Gene Symbol Disambiguation (GSD) – a special case of Word Sense Disambiguation (WSD) – is to assign a unique gene identifier for all identified gene name aliases in biology-related articles. Supervised and unsupervised machine learning WSD techniques have been applied in the biomedical field with promising results. We examine here the utilisation potential of the fact – one of the special features of biological articles – that the authors of the documents are known through graph-based semi-supervised methods for the GSD task. RESULTS: Our key hypothesis is that a biologist refers to each particular gene by a fixed gene alias and this holds for the co-authors as well. To make use of the co-authorship information we decided to build the inverse co-author graph on MedLine abstracts. The nodes of the inverse co-author graph are articles and there is an edge between two nodes if and only if the two articles have a mutual author. We introduce here two methods using distances (based on the graph) of abstracts for the GSD task. We found that a disambiguation decision can be made in 85% of cases with an extremely high (99.5%) precision rate just by using information obtained from the inverse co-author graph. We incorporated the co-authorship information into two GSD systems in order to attain full coverage and in experiments our procedure achieved precision of 94.3%, 98.85%, 96.05% and 99.63% on the human, mouse, fly and yeast GSD evaluation sets, respectively. CONCLUSION: Based on the promising results obtained so far we suggest that the co-authorship information and the circumstances of the articles' release (like the title of the journal, the year of publication) can be a crucial building block of any sophisticated similarity measure among biological articles and hence the methods introduced here should be useful for other biomedical natural language processing tasks (like organism or target disease detection) as well. BioMed Central 2008-01-29 /pmc/articles/PMC2262057/ /pubmed/18230174 http://dx.doi.org/10.1186/1471-2105-9-69 Text en Copyright © 2008 Farkas; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Farkas, Richárd
The strength of co-authorship in gene name disambiguation
title The strength of co-authorship in gene name disambiguation
title_full The strength of co-authorship in gene name disambiguation
title_fullStr The strength of co-authorship in gene name disambiguation
title_full_unstemmed The strength of co-authorship in gene name disambiguation
title_short The strength of co-authorship in gene name disambiguation
title_sort strength of co-authorship in gene name disambiguation
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2262057/
https://www.ncbi.nlm.nih.gov/pubmed/18230174
http://dx.doi.org/10.1186/1471-2105-9-69
work_keys_str_mv AT farkasrichard thestrengthofcoauthorshipingenenamedisambiguation
AT farkasrichard strengthofcoauthorshipingenenamedisambiguation