Cargando…

Gene and protein nomenclature in public databases

BACKGROUND: Frequently, several alternative names are in use for biological objects such as genes and proteins. Applications like manual literature search, automated text-mining, named entity identification, gene/protein annotation, and linking of knowledge from different information sources require...

Descripción completa

Detalles Bibliográficos
Autores principales: Fundel, Katrin, Zimmer, Ralf
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2006
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1560172/
https://www.ncbi.nlm.nih.gov/pubmed/16899134
http://dx.doi.org/10.1186/1471-2105-7-372
_version_ 1782129480217657344
author Fundel, Katrin
Zimmer, Ralf
author_facet Fundel, Katrin
Zimmer, Ralf
author_sort Fundel, Katrin
collection PubMed
description BACKGROUND: Frequently, several alternative names are in use for biological objects such as genes and proteins. Applications like manual literature search, automated text-mining, named entity identification, gene/protein annotation, and linking of knowledge from different information sources require the knowledge of all used names referring to a given gene or protein. Various organism-specific or general public databases aim at organizing knowledge about genes and proteins. These databases can be used for deriving gene and protein name dictionaries. So far, little is known about the differences between databases in terms of size, ambiguities and overlap. RESULTS: We compiled five gene and protein name dictionaries for each of the five model organisms (yeast, fly, mouse, rat, and human) from different organism-specific and general public databases. We analyzed the degree of ambiguity of gene and protein names within and between dictionaries, to a lexicon of common English words and domain-related non-gene terms, and we compared different data sources in terms of size of extracted dictionaries and overlap of synonyms between those. The study shows that the number of genes/proteins and synonyms covered in individual databases varies significantly for a given organism, and that the degree of ambiguity of synonyms varies significantly between different organisms. Furthermore, it shows that, despite considerable efforts of co-curation, the overlap of synonyms in different data sources is rather moderate and that the degree of ambiguity of gene names with common English words and domain-related non-gene terms varies depending on the considered organism. CONCLUSION: In conclusion, these results indicate that the combination of data contained in different databases allows the generation of gene and protein name dictionaries that contain significantly more used names than dictionaries obtained from individual data sources. Furthermore, curation of combined dictionaries considerably increases size and decreases ambiguity. The entries of the curated synonym dictionary are available for manual querying, editing, and PubMed- or Google-search via the ProThesaurus-wiki. For automated querying via custom software, we offer a web service and an exemplary client application.
format Text
id pubmed-1560172
institution National Center for Biotechnology Information
language English
publishDate 2006
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-15601722006-09-06 Gene and protein nomenclature in public databases Fundel, Katrin Zimmer, Ralf BMC Bioinformatics Research Article BACKGROUND: Frequently, several alternative names are in use for biological objects such as genes and proteins. Applications like manual literature search, automated text-mining, named entity identification, gene/protein annotation, and linking of knowledge from different information sources require the knowledge of all used names referring to a given gene or protein. Various organism-specific or general public databases aim at organizing knowledge about genes and proteins. These databases can be used for deriving gene and protein name dictionaries. So far, little is known about the differences between databases in terms of size, ambiguities and overlap. RESULTS: We compiled five gene and protein name dictionaries for each of the five model organisms (yeast, fly, mouse, rat, and human) from different organism-specific and general public databases. We analyzed the degree of ambiguity of gene and protein names within and between dictionaries, to a lexicon of common English words and domain-related non-gene terms, and we compared different data sources in terms of size of extracted dictionaries and overlap of synonyms between those. The study shows that the number of genes/proteins and synonyms covered in individual databases varies significantly for a given organism, and that the degree of ambiguity of synonyms varies significantly between different organisms. Furthermore, it shows that, despite considerable efforts of co-curation, the overlap of synonyms in different data sources is rather moderate and that the degree of ambiguity of gene names with common English words and domain-related non-gene terms varies depending on the considered organism. CONCLUSION: In conclusion, these results indicate that the combination of data contained in different databases allows the generation of gene and protein name dictionaries that contain significantly more used names than dictionaries obtained from individual data sources. Furthermore, curation of combined dictionaries considerably increases size and decreases ambiguity. The entries of the curated synonym dictionary are available for manual querying, editing, and PubMed- or Google-search via the ProThesaurus-wiki. For automated querying via custom software, we offer a web service and an exemplary client application. BioMed Central 2006-08-09 /pmc/articles/PMC1560172/ /pubmed/16899134 http://dx.doi.org/10.1186/1471-2105-7-372 Text en Copyright © 2006 Fundel and Zimmer; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Fundel, Katrin
Zimmer, Ralf
Gene and protein nomenclature in public databases
title Gene and protein nomenclature in public databases
title_full Gene and protein nomenclature in public databases
title_fullStr Gene and protein nomenclature in public databases
title_full_unstemmed Gene and protein nomenclature in public databases
title_short Gene and protein nomenclature in public databases
title_sort gene and protein nomenclature in public databases
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1560172/
https://www.ncbi.nlm.nih.gov/pubmed/16899134
http://dx.doi.org/10.1186/1471-2105-7-372
work_keys_str_mv AT fundelkatrin geneandproteinnomenclatureinpublicdatabases
AT zimmerralf geneandproteinnomenclatureinpublicdatabases