Cargando…

Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations

BACKGROUND: Chemical and biomedical Named Entity Recognition (NER) is an essential prerequisite task before effective text mining can begin for biochemical-text data. Exploiting unlabeled text data to leverage system performance has been an active and challenging research topic in text mining due to...

Descripción completa

Detalles Bibliográficos
Autores principales: Munkhdalai, Tsendsuren, Li, Meijing, Batsuren, Khuyagbaatar, Park, Hyeon Ah, Choi, Nak Hyeon, Ryu, Keun Ho
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331699/
https://www.ncbi.nlm.nih.gov/pubmed/25810780
http://dx.doi.org/10.1186/1758-2946-7-S1-S9
_version_ 1782357761550450688
author Munkhdalai, Tsendsuren
Li, Meijing
Batsuren, Khuyagbaatar
Park, Hyeon Ah
Choi, Nak Hyeon
Ryu, Keun Ho
author_facet Munkhdalai, Tsendsuren
Li, Meijing
Batsuren, Khuyagbaatar
Park, Hyeon Ah
Choi, Nak Hyeon
Ryu, Keun Ho
author_sort Munkhdalai, Tsendsuren
collection PubMed
description BACKGROUND: Chemical and biomedical Named Entity Recognition (NER) is an essential prerequisite task before effective text mining can begin for biochemical-text data. Exploiting unlabeled text data to leverage system performance has been an active and challenging research topic in text mining due to the recent growth in the amount of biomedical literature. We present a semi-supervised learning method that efficiently exploits unlabeled data in order to incorporate domain knowledge into a named entity recognition model and to leverage system performance. The proposed method includes Natural Language Processing (NLP) tasks for text preprocessing, learning word representation features from a large amount of text data for feature extraction, and conditional random fields for token classification. Other than the free text in the domain, the proposed method does not rely on any lexicon nor any dictionary in order to keep the system applicable to other NER tasks in bio-text data. RESULTS: We extended BANNER, a biomedical NER system, with the proposed method. This yields an integrated system that can be applied to chemical and drug NER or biomedical NER. We call our branch of the BANNER system BANNER-CHEMDNER, which is scalable over millions of documents, processing about 530 documents per minute, is configurable via XML, and can be plugged into other systems by using the BANNER Unstructured Information Management Architecture (UIMA) interface. BANNER-CHEMDNER achieved an 85.68% and an 86.47% F-measure on the testing sets of CHEMDNER Chemical Entity Mention (CEM) and Chemical Document Indexing (CDI) subtasks, respectively, and achieved an 87.04% F-measure on the official testing set of the BioCreative II gene mention task, showing remarkable performance in both chemical and biomedical NER. BANNER-CHEMDNER system is available at: https://bitbucket.org/tsendeemts/banner-chemdner.
format Online
Article
Text
id pubmed-4331699
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-43316992015-03-25 Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations Munkhdalai, Tsendsuren Li, Meijing Batsuren, Khuyagbaatar Park, Hyeon Ah Choi, Nak Hyeon Ryu, Keun Ho J Cheminform Research BACKGROUND: Chemical and biomedical Named Entity Recognition (NER) is an essential prerequisite task before effective text mining can begin for biochemical-text data. Exploiting unlabeled text data to leverage system performance has been an active and challenging research topic in text mining due to the recent growth in the amount of biomedical literature. We present a semi-supervised learning method that efficiently exploits unlabeled data in order to incorporate domain knowledge into a named entity recognition model and to leverage system performance. The proposed method includes Natural Language Processing (NLP) tasks for text preprocessing, learning word representation features from a large amount of text data for feature extraction, and conditional random fields for token classification. Other than the free text in the domain, the proposed method does not rely on any lexicon nor any dictionary in order to keep the system applicable to other NER tasks in bio-text data. RESULTS: We extended BANNER, a biomedical NER system, with the proposed method. This yields an integrated system that can be applied to chemical and drug NER or biomedical NER. We call our branch of the BANNER system BANNER-CHEMDNER, which is scalable over millions of documents, processing about 530 documents per minute, is configurable via XML, and can be plugged into other systems by using the BANNER Unstructured Information Management Architecture (UIMA) interface. BANNER-CHEMDNER achieved an 85.68% and an 86.47% F-measure on the testing sets of CHEMDNER Chemical Entity Mention (CEM) and Chemical Document Indexing (CDI) subtasks, respectively, and achieved an 87.04% F-measure on the official testing set of the BioCreative II gene mention task, showing remarkable performance in both chemical and biomedical NER. BANNER-CHEMDNER system is available at: https://bitbucket.org/tsendeemts/banner-chemdner. BioMed Central 2015-01-19 /pmc/articles/PMC4331699/ /pubmed/25810780 http://dx.doi.org/10.1186/1758-2946-7-S1-S9 Text en Copyright © 2015 Munkhdalai et al.; licensee Springer. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Munkhdalai, Tsendsuren
Li, Meijing
Batsuren, Khuyagbaatar
Park, Hyeon Ah
Choi, Nak Hyeon
Ryu, Keun Ho
Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations
title Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations
title_full Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations
title_fullStr Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations
title_full_unstemmed Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations
title_short Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations
title_sort incorporating domain knowledge in chemical and biomedical named entity recognition with word representations
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331699/
https://www.ncbi.nlm.nih.gov/pubmed/25810780
http://dx.doi.org/10.1186/1758-2946-7-S1-S9
work_keys_str_mv AT munkhdalaitsendsuren incorporatingdomainknowledgeinchemicalandbiomedicalnamedentityrecognitionwithwordrepresentations
AT limeijing incorporatingdomainknowledgeinchemicalandbiomedicalnamedentityrecognitionwithwordrepresentations
AT batsurenkhuyagbaatar incorporatingdomainknowledgeinchemicalandbiomedicalnamedentityrecognitionwithwordrepresentations
AT parkhyeonah incorporatingdomainknowledgeinchemicalandbiomedicalnamedentityrecognitionwithwordrepresentations
AT choinakhyeon incorporatingdomainknowledgeinchemicalandbiomedicalnamedentityrecognitionwithwordrepresentations
AT ryukeunho incorporatingdomainknowledgeinchemicalandbiomedicalnamedentityrecognitionwithwordrepresentations