Cargando…

NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition

BACKGROUND: Biomedical named entity recognition (Bio-NER) is a challenging problem because, in general, biomedical named entities of the same category (e.g., proteins and genes) do not follow one standard nomenclature. They have many irregularities and sometimes appear in ambiguous contexts. In rece...

Descripción completa

Detalles Bibliográficos
Autores principales:	Tsai, Richard Tzong-Han, Sung, Cheng-Lung, Dai, Hong-Jie, Hung, Hsieh-Chuan, Sung, Ting-Yi, Hsu, Wen-Lian
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2006
Materias:	Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1764467/ https://www.ncbi.nlm.nih.gov/pubmed/17254295 http://dx.doi.org/10.1186/1471-2105-7-S5-S11

_version_	1782131616827572224
author	Tsai, Richard Tzong-Han Sung, Cheng-Lung Dai, Hong-Jie Hung, Hsieh-Chuan Sung, Ting-Yi Hsu, Wen-Lian
author_facet	Tsai, Richard Tzong-Han Sung, Cheng-Lung Dai, Hong-Jie Hung, Hsieh-Chuan Sung, Ting-Yi Hsu, Wen-Lian
author_sort	Tsai, Richard Tzong-Han
collection	PubMed
description	BACKGROUND: Biomedical named entity recognition (Bio-NER) is a challenging problem because, in general, biomedical named entities of the same category (e.g., proteins and genes) do not follow one standard nomenclature. They have many irregularities and sometimes appear in ambiguous contexts. In recent years, machine-learning (ML) approaches have become increasingly common and now represent the cutting edge of Bio-NER technology. This paper addresses three problems faced by ML-based Bio-NER systems. First, most ML approaches usually employ singleton features that comprise one linguistic property (e.g., the current word is capitalized) and at least one class tag (e.g., B-protein, the beginning of a protein name). However, such features may be insufficient in cases where multiple properties must be considered. Adding conjunction features that contain multiple properties can be beneficial, but it would be infeasible to include all conjunction features in an NER model since memory resources are limited and some features are ineffective. To resolve the problem, we use a sequential forward search algorithm to select an effective set of features. Second, variations in the numerical parts of biomedical terms (e.g., "2" in the biomedical term IL2) cause data sparseness and generate many redundant features. In this case, we apply numerical normalization, which solves the problem by replacing all numerals in a term with one representative numeral to help classify named entities. Third, the assignment of NE tags does not depend solely on the target word's closest neighbors, but may depend on words outside the context window (e.g., a context window of five consists of the current word plus two preceding and two subsequent words). We use global patterns generated by the Smith-Waterman local alignment algorithm to identify such structures and modify the results of our ML-based tagger. This is called pattern-based post-processing. RESULTS: To develop our ML-based Bio-NER system, we employ conditional random fields, which have performed effectively in several well-known tasks, as our underlying ML model. Adding selected conjunction features, applying numerical normalization, and employing pattern-based post-processing improve the F-scores by 1.67%, 1.04%, and 0.57%, respectively. The combined increase of 3.28% yields a total score of 72.98%, which is better than the baseline system that only uses singleton features. CONCLUSION: We demonstrate the benefits of using the sequential forward search algorithm to select effective conjunction feature groups. In addition, we show that numerical normalization can effectively reduce the number of redundant and unseen features. Furthermore, the Smith-Waterman local alignment algorithm can help ML-based Bio-NER deal with difficult cases that need longer context windows.
format	Text
id	pubmed-1764467
institution	National Center for Biotechnology Information
language	English
publishDate	2006
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-17644672007-01-09 NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition Tsai, Richard Tzong-Han Sung, Cheng-Lung Dai, Hong-Jie Hung, Hsieh-Chuan Sung, Ting-Yi Hsu, Wen-Lian BMC Bioinformatics Proceedings BACKGROUND: Biomedical named entity recognition (Bio-NER) is a challenging problem because, in general, biomedical named entities of the same category (e.g., proteins and genes) do not follow one standard nomenclature. They have many irregularities and sometimes appear in ambiguous contexts. In recent years, machine-learning (ML) approaches have become increasingly common and now represent the cutting edge of Bio-NER technology. This paper addresses three problems faced by ML-based Bio-NER systems. First, most ML approaches usually employ singleton features that comprise one linguistic property (e.g., the current word is capitalized) and at least one class tag (e.g., B-protein, the beginning of a protein name). However, such features may be insufficient in cases where multiple properties must be considered. Adding conjunction features that contain multiple properties can be beneficial, but it would be infeasible to include all conjunction features in an NER model since memory resources are limited and some features are ineffective. To resolve the problem, we use a sequential forward search algorithm to select an effective set of features. Second, variations in the numerical parts of biomedical terms (e.g., "2" in the biomedical term IL2) cause data sparseness and generate many redundant features. In this case, we apply numerical normalization, which solves the problem by replacing all numerals in a term with one representative numeral to help classify named entities. Third, the assignment of NE tags does not depend solely on the target word's closest neighbors, but may depend on words outside the context window (e.g., a context window of five consists of the current word plus two preceding and two subsequent words). We use global patterns generated by the Smith-Waterman local alignment algorithm to identify such structures and modify the results of our ML-based tagger. This is called pattern-based post-processing. RESULTS: To develop our ML-based Bio-NER system, we employ conditional random fields, which have performed effectively in several well-known tasks, as our underlying ML model. Adding selected conjunction features, applying numerical normalization, and employing pattern-based post-processing improve the F-scores by 1.67%, 1.04%, and 0.57%, respectively. The combined increase of 3.28% yields a total score of 72.98%, which is better than the baseline system that only uses singleton features. CONCLUSION: We demonstrate the benefits of using the sequential forward search algorithm to select effective conjunction feature groups. In addition, we show that numerical normalization can effectively reduce the number of redundant and unseen features. Furthermore, the Smith-Waterman local alignment algorithm can help ML-based Bio-NER deal with difficult cases that need longer context windows. BioMed Central 2006-12-18 /pmc/articles/PMC1764467/ /pubmed/17254295 http://dx.doi.org/10.1186/1471-2105-7-S5-S11 Text en Copyright © 2006 Tsai et al; licensee BioMed Central Ltd http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Proceedings Tsai, Richard Tzong-Han Sung, Cheng-Lung Dai, Hong-Jie Hung, Hsieh-Chuan Sung, Ting-Yi Hsu, Wen-Lian NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition
title	NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition
title_full	NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition
title_fullStr	NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition
title_full_unstemmed	NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition
title_short	NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition
title_sort	nerbio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition
topic	Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1764467/ https://www.ncbi.nlm.nih.gov/pubmed/17254295 http://dx.doi.org/10.1186/1471-2105-7-S5-S11
work_keys_str_mv	AT tsairichardtzonghan nerbiousingselectedwordconjunctionstermnormalizationandglobalpatternstoimprovebiomedicalnamedentityrecognition AT sungchenglung nerbiousingselectedwordconjunctionstermnormalizationandglobalpatternstoimprovebiomedicalnamedentityrecognition AT daihongjie nerbiousingselectedwordconjunctionstermnormalizationandglobalpatternstoimprovebiomedicalnamedentityrecognition AT hunghsiehchuan nerbiousingselectedwordconjunctionstermnormalizationandglobalpatternstoimprovebiomedicalnamedentityrecognition AT sungtingyi nerbiousingselectedwordconjunctionstermnormalizationandglobalpatternstoimprovebiomedicalnamedentityrecognition AT hsuwenlian nerbiousingselectedwordconjunctionstermnormalizationandglobalpatternstoimprovebiomedicalnamedentityrecognition

NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition

Ejemplares similares