Cargando…

Systematic feature evaluation for gene name recognition

In task 1A of the BioCreAtIvE evaluation, systems had to be devised that recognize words and phrases forming gene or protein names in natural language sentences. We approach this problem by building a word classification system based on a sliding window approach with a Support Vector Machine, combin...

Descripción completa

Detalles Bibliográficos
Autores principales:	Hakenberg, Jörg, Bickel, Steffen, Plake, Conrad, Brefeld, Ulf, Zahn, Hagen, Faulstich, Lukas, Leser, Ulf, Scheffer, Tobias
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2005
Materias:	Report
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1869023/ https://www.ncbi.nlm.nih.gov/pubmed/15960843 http://dx.doi.org/10.1186/1471-2105-6-S1-S9

_version_	1782133430655385600
author	Hakenberg, Jörg Bickel, Steffen Plake, Conrad Brefeld, Ulf Zahn, Hagen Faulstich, Lukas Leser, Ulf Scheffer, Tobias
author_facet	Hakenberg, Jörg Bickel, Steffen Plake, Conrad Brefeld, Ulf Zahn, Hagen Faulstich, Lukas Leser, Ulf Scheffer, Tobias
author_sort	Hakenberg, Jörg
collection	PubMed
description	In task 1A of the BioCreAtIvE evaluation, systems had to be devised that recognize words and phrases forming gene or protein names in natural language sentences. We approach this problem by building a word classification system based on a sliding window approach with a Support Vector Machine, combined with a pattern-based post-processing for the recognition of phrases. The performance of such a system crucially depends on the type of features chosen for consideration by the classification method, such as pre- or postfixes, character n-grams, patterns of capitalization, or classification of preceding or following words. We present a systematic approach to evaluate the performance of different feature sets based on recursive feature elimination, RFE. Based on a systematic reduction of the number of features used by the system, we can quantify the impact of different feature sets on the results of the word classification problem. This helps us to identify descriptive features, to learn about the structure of the problem, and to design systems that are faster and easier to understand. We observe that the SVM is robust to redundant features. RFE improves the performance by 0.7%, compared to using the complete set of attributes. Moreover, a performance that is only 2.3% below this maximum can be obtained using fewer than 5% of the features.
format	Text
id	pubmed-1869023
institution	National Center for Biotechnology Information
language	English
publishDate	2005
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-18690232007-05-18 Systematic feature evaluation for gene name recognition Hakenberg, Jörg Bickel, Steffen Plake, Conrad Brefeld, Ulf Zahn, Hagen Faulstich, Lukas Leser, Ulf Scheffer, Tobias BMC Bioinformatics Report In task 1A of the BioCreAtIvE evaluation, systems had to be devised that recognize words and phrases forming gene or protein names in natural language sentences. We approach this problem by building a word classification system based on a sliding window approach with a Support Vector Machine, combined with a pattern-based post-processing for the recognition of phrases. The performance of such a system crucially depends on the type of features chosen for consideration by the classification method, such as pre- or postfixes, character n-grams, patterns of capitalization, or classification of preceding or following words. We present a systematic approach to evaluate the performance of different feature sets based on recursive feature elimination, RFE. Based on a systematic reduction of the number of features used by the system, we can quantify the impact of different feature sets on the results of the word classification problem. This helps us to identify descriptive features, to learn about the structure of the problem, and to design systems that are faster and easier to understand. We observe that the SVM is robust to redundant features. RFE improves the performance by 0.7%, compared to using the complete set of attributes. Moreover, a performance that is only 2.3% below this maximum can be obtained using fewer than 5% of the features. BioMed Central 2005-05-24 /pmc/articles/PMC1869023/ /pubmed/15960843 http://dx.doi.org/10.1186/1471-2105-6-S1-S9 Text en Copyright © 2005 Hakenberg et al; licensee BioMed Central Ltd http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Report Hakenberg, Jörg Bickel, Steffen Plake, Conrad Brefeld, Ulf Zahn, Hagen Faulstich, Lukas Leser, Ulf Scheffer, Tobias Systematic feature evaluation for gene name recognition
title	Systematic feature evaluation for gene name recognition
title_full	Systematic feature evaluation for gene name recognition
title_fullStr	Systematic feature evaluation for gene name recognition
title_full_unstemmed	Systematic feature evaluation for gene name recognition
title_short	Systematic feature evaluation for gene name recognition
title_sort	systematic feature evaluation for gene name recognition
topic	Report
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1869023/ https://www.ncbi.nlm.nih.gov/pubmed/15960843 http://dx.doi.org/10.1186/1471-2105-6-S1-S9
work_keys_str_mv	AT hakenbergjorg systematicfeatureevaluationforgenenamerecognition AT bickelsteffen systematicfeatureevaluationforgenenamerecognition AT plakeconrad systematicfeatureevaluationforgenenamerecognition AT brefeldulf systematicfeatureevaluationforgenenamerecognition AT zahnhagen systematicfeatureevaluationforgenenamerecognition AT faulstichlukas systematicfeatureevaluationforgenenamerecognition AT leserulf systematicfeatureevaluationforgenenamerecognition AT scheffertobias systematicfeatureevaluationforgenenamerecognition

Systematic feature evaluation for gene name recognition

Ejemplares similares