Cargando…

Thesaurus-based disambiguation of gene symbols

BACKGROUND: Massive text mining of the biological literature holds great promise of relating disparate information and discovering new knowledge. However, disambiguation of gene symbols is a major bottleneck. RESULTS: We developed a simple thesaurus-based disambiguation algorithm that can operate wi...

Descripción completa

Detalles Bibliográficos
Autores principales: Schijvenaars, Bob JA, Mons, Barend, Weeber, Marc, Schuemie, Martijn J, van Mulligen, Erik M, Wain, Hester M, Kors, Jan A
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2005
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1183190/
https://www.ncbi.nlm.nih.gov/pubmed/15958172
http://dx.doi.org/10.1186/1471-2105-6-149
_version_ 1782124676342874112
author Schijvenaars, Bob JA
Mons, Barend
Weeber, Marc
Schuemie, Martijn J
van Mulligen, Erik M
Wain, Hester M
Kors, Jan A
author_facet Schijvenaars, Bob JA
Mons, Barend
Weeber, Marc
Schuemie, Martijn J
van Mulligen, Erik M
Wain, Hester M
Kors, Jan A
author_sort Schijvenaars, Bob JA
collection PubMed
description BACKGROUND: Massive text mining of the biological literature holds great promise of relating disparate information and discovering new knowledge. However, disambiguation of gene symbols is a major bottleneck. RESULTS: We developed a simple thesaurus-based disambiguation algorithm that can operate with very little training data. The thesaurus comprises the information from five human genetic databases and MeSH. The extent of the homonym problem for human gene symbols is shown to be substantial (33% of the genes in our combined thesaurus had one or more ambiguous symbols), not only because one symbol can refer to multiple genes, but also because a gene symbol can have many non-gene meanings. A test set of 52,529 Medline abstracts, containing 690 ambiguous human gene symbols taken from OMIM, was automatically generated. Overall accuracy of the disambiguation algorithm was up to 92.7% on the test set. CONCLUSION: The ambiguity of human gene symbols is substantial, not only because one symbol may denote multiple genes but particularly because many symbols have other, non-gene meanings. The proposed disambiguation approach resolves most ambiguities in our test set with high accuracy, including the important gene/not a gene decisions. The algorithm is fast and scalable, enabling gene-symbol disambiguation in massive text mining applications.
format Text
id pubmed-1183190
institution National Center for Biotechnology Information
language English
publishDate 2005
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-11831902005-08-06 Thesaurus-based disambiguation of gene symbols Schijvenaars, Bob JA Mons, Barend Weeber, Marc Schuemie, Martijn J van Mulligen, Erik M Wain, Hester M Kors, Jan A BMC Bioinformatics Research Article BACKGROUND: Massive text mining of the biological literature holds great promise of relating disparate information and discovering new knowledge. However, disambiguation of gene symbols is a major bottleneck. RESULTS: We developed a simple thesaurus-based disambiguation algorithm that can operate with very little training data. The thesaurus comprises the information from five human genetic databases and MeSH. The extent of the homonym problem for human gene symbols is shown to be substantial (33% of the genes in our combined thesaurus had one or more ambiguous symbols), not only because one symbol can refer to multiple genes, but also because a gene symbol can have many non-gene meanings. A test set of 52,529 Medline abstracts, containing 690 ambiguous human gene symbols taken from OMIM, was automatically generated. Overall accuracy of the disambiguation algorithm was up to 92.7% on the test set. CONCLUSION: The ambiguity of human gene symbols is substantial, not only because one symbol may denote multiple genes but particularly because many symbols have other, non-gene meanings. The proposed disambiguation approach resolves most ambiguities in our test set with high accuracy, including the important gene/not a gene decisions. The algorithm is fast and scalable, enabling gene-symbol disambiguation in massive text mining applications. BioMed Central 2005-06-16 /pmc/articles/PMC1183190/ /pubmed/15958172 http://dx.doi.org/10.1186/1471-2105-6-149 Text en Copyright © 2005 Schijvenaars et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Schijvenaars, Bob JA
Mons, Barend
Weeber, Marc
Schuemie, Martijn J
van Mulligen, Erik M
Wain, Hester M
Kors, Jan A
Thesaurus-based disambiguation of gene symbols
title Thesaurus-based disambiguation of gene symbols
title_full Thesaurus-based disambiguation of gene symbols
title_fullStr Thesaurus-based disambiguation of gene symbols
title_full_unstemmed Thesaurus-based disambiguation of gene symbols
title_short Thesaurus-based disambiguation of gene symbols
title_sort thesaurus-based disambiguation of gene symbols
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1183190/
https://www.ncbi.nlm.nih.gov/pubmed/15958172
http://dx.doi.org/10.1186/1471-2105-6-149
work_keys_str_mv AT schijvenaarsbobja thesaurusbaseddisambiguationofgenesymbols
AT monsbarend thesaurusbaseddisambiguationofgenesymbols
AT weebermarc thesaurusbaseddisambiguationofgenesymbols
AT schuemiemartijnj thesaurusbaseddisambiguationofgenesymbols
AT vanmulligenerikm thesaurusbaseddisambiguationofgenesymbols
AT wainhesterm thesaurusbaseddisambiguationofgenesymbols
AT korsjana thesaurusbaseddisambiguationofgenesymbols