Cargando…

Combined SVM-CRFs for Biological Named Entity Recognition with Maximal Bidirectional Squeezing

Biological named entity recognition, the identification of biological terms in text, is essential for biomedical information extraction. Machine learning-based approaches have been widely applied in this area. However, the recognition performance of current approaches could still be improved. Our no...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhu, Fei, Shen, Bairong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3383748/
https://www.ncbi.nlm.nih.gov/pubmed/22745720
http://dx.doi.org/10.1371/journal.pone.0039230
_version_ 1782236649394012160
author Zhu, Fei
Shen, Bairong
author_facet Zhu, Fei
Shen, Bairong
author_sort Zhu, Fei
collection PubMed
description Biological named entity recognition, the identification of biological terms in text, is essential for biomedical information extraction. Machine learning-based approaches have been widely applied in this area. However, the recognition performance of current approaches could still be improved. Our novel approach is to combine support vector machines (SVMs) and conditional random fields (CRFs), which can complement and facilitate each other. During the hybrid process, we use SVM to separate biological terms from non-biological terms, before we use CRFs to determine the types of biological terms, which makes full use of the power of SVM as a binary-class classifier and the data-labeling capacity of CRFs. We then merge the results of SVM and CRFs. To remove any inconsistencies that might result from the merging, we develop a useful algorithm and apply two rules. To ensure biological terms with a maximum length are identified, we propose a maximal bidirectional squeezing approach that finds the longest term. We also add a positive gain to rare events to reinforce their probability and avoid bias. Our approach will also gradually extend the context so more contextual information can be included. We examined the performance of four approaches with GENIA corpus and JNLPBA04 data. The combination of SVM and CRFs improved performance. The macro-precision, macro-recall, and macro-F(1) of the SVM-CRFs hybrid approach surpassed conventional SVM and CRFs. After applying the new algorithms, the macro-F1 reached 91.67% with the GENIA corpus and 84.04% with the JNLPBA04 data.
format Online
Article
Text
id pubmed-3383748
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-33837482012-06-28 Combined SVM-CRFs for Biological Named Entity Recognition with Maximal Bidirectional Squeezing Zhu, Fei Shen, Bairong PLoS One Research Article Biological named entity recognition, the identification of biological terms in text, is essential for biomedical information extraction. Machine learning-based approaches have been widely applied in this area. However, the recognition performance of current approaches could still be improved. Our novel approach is to combine support vector machines (SVMs) and conditional random fields (CRFs), which can complement and facilitate each other. During the hybrid process, we use SVM to separate biological terms from non-biological terms, before we use CRFs to determine the types of biological terms, which makes full use of the power of SVM as a binary-class classifier and the data-labeling capacity of CRFs. We then merge the results of SVM and CRFs. To remove any inconsistencies that might result from the merging, we develop a useful algorithm and apply two rules. To ensure biological terms with a maximum length are identified, we propose a maximal bidirectional squeezing approach that finds the longest term. We also add a positive gain to rare events to reinforce their probability and avoid bias. Our approach will also gradually extend the context so more contextual information can be included. We examined the performance of four approaches with GENIA corpus and JNLPBA04 data. The combination of SVM and CRFs improved performance. The macro-precision, macro-recall, and macro-F(1) of the SVM-CRFs hybrid approach surpassed conventional SVM and CRFs. After applying the new algorithms, the macro-F1 reached 91.67% with the GENIA corpus and 84.04% with the JNLPBA04 data. Public Library of Science 2012-06-26 /pmc/articles/PMC3383748/ /pubmed/22745720 http://dx.doi.org/10.1371/journal.pone.0039230 Text en Zhu, Shen. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Zhu, Fei
Shen, Bairong
Combined SVM-CRFs for Biological Named Entity Recognition with Maximal Bidirectional Squeezing
title Combined SVM-CRFs for Biological Named Entity Recognition with Maximal Bidirectional Squeezing
title_full Combined SVM-CRFs for Biological Named Entity Recognition with Maximal Bidirectional Squeezing
title_fullStr Combined SVM-CRFs for Biological Named Entity Recognition with Maximal Bidirectional Squeezing
title_full_unstemmed Combined SVM-CRFs for Biological Named Entity Recognition with Maximal Bidirectional Squeezing
title_short Combined SVM-CRFs for Biological Named Entity Recognition with Maximal Bidirectional Squeezing
title_sort combined svm-crfs for biological named entity recognition with maximal bidirectional squeezing
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3383748/
https://www.ncbi.nlm.nih.gov/pubmed/22745720
http://dx.doi.org/10.1371/journal.pone.0039230
work_keys_str_mv AT zhufei combinedsvmcrfsforbiologicalnamedentityrecognitionwithmaximalbidirectionalsqueezing
AT shenbairong combinedsvmcrfsforbiologicalnamedentityrecognitionwithmaximalbidirectionalsqueezing