Cargando…

NTTMUNSW BioC modules for recognizing and normalizing species and gene/protein mentions

In recent years, the number of published biomedical articles has increased as researchers have focused on biological domains to investigate the functions of biological objects, such as genes and proteins. However, the ambiguous nature of genes and their products have rendered the literature more com...

Descripción completa

Detalles Bibliográficos
Autores principales: Dai, Hong-Jie, Singh, Onkar, Jonnagaddala, Jitendra, Su, Emily Chia-Yu
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4962763/
https://www.ncbi.nlm.nih.gov/pubmed/27465130
http://dx.doi.org/10.1093/database/baw111
_version_ 1782444875985190912
author Dai, Hong-Jie
Singh, Onkar
Jonnagaddala, Jitendra
Su, Emily Chia-Yu
author_facet Dai, Hong-Jie
Singh, Onkar
Jonnagaddala, Jitendra
Su, Emily Chia-Yu
author_sort Dai, Hong-Jie
collection PubMed
description In recent years, the number of published biomedical articles has increased as researchers have focused on biological domains to investigate the functions of biological objects, such as genes and proteins. However, the ambiguous nature of genes and their products have rendered the literature more complex for readers and curators of molecular interaction databases. To address this challenge, a normalization technique that can link variants of biological objects to a single, standardized form was applied. In this work, we developed a species normalization module, which recognizes species names and normalizes them to NCBI Taxonomy IDs. Unlike most previous work, which ignored the prefix of a gene name that represents an abbreviation of the species name to which the gene belongs, the recognition results of our module include the prefixed species. The developed species normalization module achieved an overall F-score of 0.954 on an instance-level species normalization corpus. For gene normalization, two separate modules were respectively employed to recognize gene mentions and normalize those mentions to their Entrez Gene IDs by utilizing a multistage normalization algorithm developed for processing full-text articles. All of the developed modules are BioC-compatible .NET framework libraries and are publicly available from the NuGet gallery. Database URL: https://sites.google.com/site/hjdairesearch/Projects/isn-corpus
format Online
Article
Text
id pubmed-4962763
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-49627632016-07-28 NTTMUNSW BioC modules for recognizing and normalizing species and gene/protein mentions Dai, Hong-Jie Singh, Onkar Jonnagaddala, Jitendra Su, Emily Chia-Yu Database (Oxford) Original Article In recent years, the number of published biomedical articles has increased as researchers have focused on biological domains to investigate the functions of biological objects, such as genes and proteins. However, the ambiguous nature of genes and their products have rendered the literature more complex for readers and curators of molecular interaction databases. To address this challenge, a normalization technique that can link variants of biological objects to a single, standardized form was applied. In this work, we developed a species normalization module, which recognizes species names and normalizes them to NCBI Taxonomy IDs. Unlike most previous work, which ignored the prefix of a gene name that represents an abbreviation of the species name to which the gene belongs, the recognition results of our module include the prefixed species. The developed species normalization module achieved an overall F-score of 0.954 on an instance-level species normalization corpus. For gene normalization, two separate modules were respectively employed to recognize gene mentions and normalize those mentions to their Entrez Gene IDs by utilizing a multistage normalization algorithm developed for processing full-text articles. All of the developed modules are BioC-compatible .NET framework libraries and are publicly available from the NuGet gallery. Database URL: https://sites.google.com/site/hjdairesearch/Projects/isn-corpus Oxford University Press 2016-07-27 /pmc/articles/PMC4962763/ /pubmed/27465130 http://dx.doi.org/10.1093/database/baw111 Text en © The Author(s) 2016. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Dai, Hong-Jie
Singh, Onkar
Jonnagaddala, Jitendra
Su, Emily Chia-Yu
NTTMUNSW BioC modules for recognizing and normalizing species and gene/protein mentions
title NTTMUNSW BioC modules for recognizing and normalizing species and gene/protein mentions
title_full NTTMUNSW BioC modules for recognizing and normalizing species and gene/protein mentions
title_fullStr NTTMUNSW BioC modules for recognizing and normalizing species and gene/protein mentions
title_full_unstemmed NTTMUNSW BioC modules for recognizing and normalizing species and gene/protein mentions
title_short NTTMUNSW BioC modules for recognizing and normalizing species and gene/protein mentions
title_sort nttmunsw bioc modules for recognizing and normalizing species and gene/protein mentions
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4962763/
https://www.ncbi.nlm.nih.gov/pubmed/27465130
http://dx.doi.org/10.1093/database/baw111
work_keys_str_mv AT daihongjie nttmunswbiocmodulesforrecognizingandnormalizingspeciesandgeneproteinmentions
AT singhonkar nttmunswbiocmodulesforrecognizingandnormalizingspeciesandgeneproteinmentions
AT jonnagaddalajitendra nttmunswbiocmodulesforrecognizingandnormalizingspeciesandgeneproteinmentions
AT suemilychiayu nttmunswbiocmodulesforrecognizingandnormalizingspeciesandgeneproteinmentions