Cargando…

Incorporating rich background knowledge for gene named entity classification and recognition

BACKGROUND: Gene named entity classification and recognition are crucial preliminary steps of text mining in biomedical literature. Machine learning based methods have been used in this area with great success. In most state-of-the-art systems, elaborately designed lexical features, such as words, n...

Descripción completa

Detalles Bibliográficos
Autores principales:	Li, Yanpeng, Lin, Hongfei, Yang, Zhihao
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2009
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2725142/ https://www.ncbi.nlm.nih.gov/pubmed/19615051 http://dx.doi.org/10.1186/1471-2105-10-223

_version_	1782170491230879744
author	Li, Yanpeng Lin, Hongfei Yang, Zhihao
author_facet	Li, Yanpeng Lin, Hongfei Yang, Zhihao
author_sort	Li, Yanpeng
collection	PubMed
description	BACKGROUND: Gene named entity classification and recognition are crucial preliminary steps of text mining in biomedical literature. Machine learning based methods have been used in this area with great success. In most state-of-the-art systems, elaborately designed lexical features, such as words, n-grams, and morphology patterns, have played a central part. However, this type of feature tends to cause extreme sparseness in feature space. As a result, out-of-vocabulary (OOV) terms in the training data are not modeled well due to lack of information. RESULTS: We propose a general framework for gene named entity representation, called feature coupling generalization (FCG). The basic idea is to generate higher level features using term frequency and co-occurrence information of highly indicative features in huge amount of unlabeled data. We examine its performance in a named entity classification task, which is designed to remove non-gene entries in a large dictionary derived from online resources. The results show that new features generated by FCG outperform lexical features by 5.97 F-score and 10.85 for OOV terms. Also in this framework each extension yields significant improvements and the sparse lexical features can be transformed into both a lower dimensional and more informative representation. A forward maximum match method based on the refined dictionary produces an F-score of 86.2 on BioCreative 2 GM test set. Then we combined the dictionary with a conditional random field (CRF) based gene mention tagger, achieving an F-score of 89.05, which improves the performance of the CRF-based tagger by 4.46 with little impact on the efficiency of the recognition system. A demo of the NER system is available at .
format	Text
id	pubmed-2725142
institution	National Center for Biotechnology Information
language	English
publishDate	2009
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-27251422009-08-12 Incorporating rich background knowledge for gene named entity classification and recognition Li, Yanpeng Lin, Hongfei Yang, Zhihao BMC Bioinformatics Research Article BACKGROUND: Gene named entity classification and recognition are crucial preliminary steps of text mining in biomedical literature. Machine learning based methods have been used in this area with great success. In most state-of-the-art systems, elaborately designed lexical features, such as words, n-grams, and morphology patterns, have played a central part. However, this type of feature tends to cause extreme sparseness in feature space. As a result, out-of-vocabulary (OOV) terms in the training data are not modeled well due to lack of information. RESULTS: We propose a general framework for gene named entity representation, called feature coupling generalization (FCG). The basic idea is to generate higher level features using term frequency and co-occurrence information of highly indicative features in huge amount of unlabeled data. We examine its performance in a named entity classification task, which is designed to remove non-gene entries in a large dictionary derived from online resources. The results show that new features generated by FCG outperform lexical features by 5.97 F-score and 10.85 for OOV terms. Also in this framework each extension yields significant improvements and the sparse lexical features can be transformed into both a lower dimensional and more informative representation. A forward maximum match method based on the refined dictionary produces an F-score of 86.2 on BioCreative 2 GM test set. Then we combined the dictionary with a conditional random field (CRF) based gene mention tagger, achieving an F-score of 89.05, which improves the performance of the CRF-based tagger by 4.46 with little impact on the efficiency of the recognition system. A demo of the NER system is available at . BioMed Central 2009-07-17 /pmc/articles/PMC2725142/ /pubmed/19615051 http://dx.doi.org/10.1186/1471-2105-10-223 Text en Copyright © 2009 Li et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Li, Yanpeng Lin, Hongfei Yang, Zhihao Incorporating rich background knowledge for gene named entity classification and recognition
title	Incorporating rich background knowledge for gene named entity classification and recognition
title_full	Incorporating rich background knowledge for gene named entity classification and recognition
title_fullStr	Incorporating rich background knowledge for gene named entity classification and recognition
title_full_unstemmed	Incorporating rich background knowledge for gene named entity classification and recognition
title_short	Incorporating rich background knowledge for gene named entity classification and recognition
title_sort	incorporating rich background knowledge for gene named entity classification and recognition
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2725142/ https://www.ncbi.nlm.nih.gov/pubmed/19615051 http://dx.doi.org/10.1186/1471-2105-10-223
work_keys_str_mv	AT liyanpeng incorporatingrichbackgroundknowledgeforgenenamedentityclassificationandrecognition AT linhongfei incorporatingrichbackgroundknowledgeforgenenamedentityclassificationandrecognition AT yangzhihao incorporatingrichbackgroundknowledgeforgenenamedentityclassificationandrecognition

Incorporating rich background knowledge for gene named entity classification and recognition

Ejemplares similares