Cargando…

GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains

The automatic recognition of gene names and their associated database identifiers from biomedical text has been widely studied in recent years, as these tasks play an important role in many downstream text-mining applications. Despite significant previous research, only a small number of tools are p...

Descripción completa

Detalles Bibliográficos
Autores principales: Wei, Chih-Hsuan, Kao, Hung-Yu, Lu, Zhiyong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Hindawi Publishing Corporation 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4561873/
https://www.ncbi.nlm.nih.gov/pubmed/26380306
http://dx.doi.org/10.1155/2015/918710
_version_ 1782389075896958976
author Wei, Chih-Hsuan
Kao, Hung-Yu
Lu, Zhiyong
author_facet Wei, Chih-Hsuan
Kao, Hung-Yu
Lu, Zhiyong
author_sort Wei, Chih-Hsuan
collection PubMed
description The automatic recognition of gene names and their associated database identifiers from biomedical text has been widely studied in recent years, as these tasks play an important role in many downstream text-mining applications. Despite significant previous research, only a small number of tools are publicly available and these tools are typically restricted to detecting only mention level gene names or only document level gene identifiers. In this work, we report GNormPlus: an end-to-end and open source system that handles both gene mention and identifier detection. We created a new corpus of 694 PubMed articles to support our development of GNormPlus, containing manual annotations for not only gene names and their identifiers, but also closely related concepts useful for gene name disambiguation, such as gene families and protein domains. GNormPlus integrates several advanced text-mining techniques, including SimConcept for resolving composite gene names. As a result, GNormPlus compares favorably to other state-of-the-art methods when evaluated on two widely used public benchmarking datasets, achieving 86.7% F1-score on the BioCreative II Gene Normalization task dataset and 50.1% F1-score on the BioCreative III Gene Normalization task dataset. The GNormPlus source code and its annotated corpus are freely available, and the results of applying GNormPlus to the entire PubMed are freely accessible through our web-based tool PubTator.
format Online
Article
Text
id pubmed-4561873
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Hindawi Publishing Corporation
record_format MEDLINE/PubMed
spelling pubmed-45618732015-09-15 GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains Wei, Chih-Hsuan Kao, Hung-Yu Lu, Zhiyong Biomed Res Int Research Article The automatic recognition of gene names and their associated database identifiers from biomedical text has been widely studied in recent years, as these tasks play an important role in many downstream text-mining applications. Despite significant previous research, only a small number of tools are publicly available and these tools are typically restricted to detecting only mention level gene names or only document level gene identifiers. In this work, we report GNormPlus: an end-to-end and open source system that handles both gene mention and identifier detection. We created a new corpus of 694 PubMed articles to support our development of GNormPlus, containing manual annotations for not only gene names and their identifiers, but also closely related concepts useful for gene name disambiguation, such as gene families and protein domains. GNormPlus integrates several advanced text-mining techniques, including SimConcept for resolving composite gene names. As a result, GNormPlus compares favorably to other state-of-the-art methods when evaluated on two widely used public benchmarking datasets, achieving 86.7% F1-score on the BioCreative II Gene Normalization task dataset and 50.1% F1-score on the BioCreative III Gene Normalization task dataset. The GNormPlus source code and its annotated corpus are freely available, and the results of applying GNormPlus to the entire PubMed are freely accessible through our web-based tool PubTator. Hindawi Publishing Corporation 2015 2015-08-25 /pmc/articles/PMC4561873/ /pubmed/26380306 http://dx.doi.org/10.1155/2015/918710 Text en Copyright © 2015 Chih-Hsuan Wei et al. https://creativecommons.org/licenses/by/3.0/ This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Wei, Chih-Hsuan
Kao, Hung-Yu
Lu, Zhiyong
GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains
title GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains
title_full GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains
title_fullStr GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains
title_full_unstemmed GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains
title_short GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains
title_sort gnormplus: an integrative approach for tagging genes, gene families, and protein domains
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4561873/
https://www.ncbi.nlm.nih.gov/pubmed/26380306
http://dx.doi.org/10.1155/2015/918710
work_keys_str_mv AT weichihhsuan gnormplusanintegrativeapproachfortagginggenesgenefamiliesandproteindomains
AT kaohungyu gnormplusanintegrativeapproachfortagginggenesgenefamiliesandproteindomains
AT luzhiyong gnormplusanintegrativeapproachfortagginggenesgenefamiliesandproteindomains