Cargando…

A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature

BACKGROUND: Chemical compounds and drugs (together called chemical entities) embedded in scientific articles are crucial for many information extraction tasks in the biomedical domain. However, only a very limited number of chemical entity recognition systems are publically available, probably due t...

Descripción completa

Detalles Bibliográficos
Autores principales:	Tang, Buzhou, Feng, Yudong, Wang, Xiaolong, Wu, Yonghui, Zhang, Yaoyun, Jiang, Min, Wang, Jingqi, Xu, Hua
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331698/ https://www.ncbi.nlm.nih.gov/pubmed/25810779 http://dx.doi.org/10.1186/1758-2946-7-S1-S8

_version_	1782357761313472512
author	Tang, Buzhou Feng, Yudong Wang, Xiaolong Wu, Yonghui Zhang, Yaoyun Jiang, Min Wang, Jingqi Xu, Hua
author_facet	Tang, Buzhou Feng, Yudong Wang, Xiaolong Wu, Yonghui Zhang, Yaoyun Jiang, Min Wang, Jingqi Xu, Hua
author_sort	Tang, Buzhou
collection	PubMed
description	BACKGROUND: Chemical compounds and drugs (together called chemical entities) embedded in scientific articles are crucial for many information extraction tasks in the biomedical domain. However, only a very limited number of chemical entity recognition systems are publically available, probably due to the lack of large manually annotated corpora. To accelerate the development of chemical entity recognition systems, the Spanish National Cancer Research Center (CNIO) and The University of Navarra organized a challenge on Chemical and Drug Named Entity Recognition (CHEMDNER). The CHEMDNER challenge contains two individual subtasks: 1) Chemical Entity Mention recognition (CEM); and 2) Chemical Document Indexing (CDI). Our study proposes machine learning-based systems for the CEM task. METHODS: The 2013 CHEMDNER challenge organizers provided a manually annotated 10,000 UTF8-encoded PubMed abstracts according to a predefined annotation guideline: a training set of 3,500 abstracts, a development set of 3,500 abstracts and a test set of 3,000 abstracts. We developed machine learning-based systems, based on conditional random fields (CRF) and structured support vector machines (SSVM) respectively, for the CEM task for this data set. The effects of three types of word representation (WR) features, generated by Brown clustering, random indexing and skip-gram, on both two machine learning-based systems were also investigated. The performance of our system was evaluated on the test set using scripts provided by the CHEMDNER challenge organizers. Primary evaluation measures were micro Precision, Recall, and F-measure. RESULTS: Our best system was among the top ranked systems with an official micro F-measure of 85.05%. Fixing a bug caused by inconsistent features marginally improved the performance (micro F-measure of 85.20%) of the system. CONCLUSIONS: The SSVM-based CEM systems outperformed the CRF-based CEM systems when using the same features. Each type of the WR feature was beneficial to the CEM task. Both the CRF-based and SSVM-based systems using the all three types of WR features showed better performance than the systems using only one type of the WR feature.
format	Online Article Text
id	pubmed-4331698
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-43316982015-03-25 A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature Tang, Buzhou Feng, Yudong Wang, Xiaolong Wu, Yonghui Zhang, Yaoyun Jiang, Min Wang, Jingqi Xu, Hua J Cheminform Research BACKGROUND: Chemical compounds and drugs (together called chemical entities) embedded in scientific articles are crucial for many information extraction tasks in the biomedical domain. However, only a very limited number of chemical entity recognition systems are publically available, probably due to the lack of large manually annotated corpora. To accelerate the development of chemical entity recognition systems, the Spanish National Cancer Research Center (CNIO) and The University of Navarra organized a challenge on Chemical and Drug Named Entity Recognition (CHEMDNER). The CHEMDNER challenge contains two individual subtasks: 1) Chemical Entity Mention recognition (CEM); and 2) Chemical Document Indexing (CDI). Our study proposes machine learning-based systems for the CEM task. METHODS: The 2013 CHEMDNER challenge organizers provided a manually annotated 10,000 UTF8-encoded PubMed abstracts according to a predefined annotation guideline: a training set of 3,500 abstracts, a development set of 3,500 abstracts and a test set of 3,000 abstracts. We developed machine learning-based systems, based on conditional random fields (CRF) and structured support vector machines (SSVM) respectively, for the CEM task for this data set. The effects of three types of word representation (WR) features, generated by Brown clustering, random indexing and skip-gram, on both two machine learning-based systems were also investigated. The performance of our system was evaluated on the test set using scripts provided by the CHEMDNER challenge organizers. Primary evaluation measures were micro Precision, Recall, and F-measure. RESULTS: Our best system was among the top ranked systems with an official micro F-measure of 85.05%. Fixing a bug caused by inconsistent features marginally improved the performance (micro F-measure of 85.20%) of the system. CONCLUSIONS: The SSVM-based CEM systems outperformed the CRF-based CEM systems when using the same features. Each type of the WR feature was beneficial to the CEM task. Both the CRF-based and SSVM-based systems using the all three types of WR features showed better performance than the systems using only one type of the WR feature. BioMed Central 2015-01-19 /pmc/articles/PMC4331698/ /pubmed/25810779 http://dx.doi.org/10.1186/1758-2946-7-S1-S8 Text en Copyright © 2015 Tang et al.; licensee Springer. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Tang, Buzhou Feng, Yudong Wang, Xiaolong Wu, Yonghui Zhang, Yaoyun Jiang, Min Wang, Jingqi Xu, Hua A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature
title	A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature
title_full	A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature
title_fullStr	A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature
title_full_unstemmed	A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature
title_short	A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature
title_sort	comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331698/ https://www.ncbi.nlm.nih.gov/pubmed/25810779 http://dx.doi.org/10.1186/1758-2946-7-S1-S8
work_keys_str_mv	AT tangbuzhou acomparisonofconditionalrandomfieldsandstructuredsupportvectormachinesforchemicalentityrecognitioninbiomedicalliterature AT fengyudong acomparisonofconditionalrandomfieldsandstructuredsupportvectormachinesforchemicalentityrecognitioninbiomedicalliterature AT wangxiaolong acomparisonofconditionalrandomfieldsandstructuredsupportvectormachinesforchemicalentityrecognitioninbiomedicalliterature AT wuyonghui acomparisonofconditionalrandomfieldsandstructuredsupportvectormachinesforchemicalentityrecognitioninbiomedicalliterature AT zhangyaoyun acomparisonofconditionalrandomfieldsandstructuredsupportvectormachinesforchemicalentityrecognitioninbiomedicalliterature AT jiangmin acomparisonofconditionalrandomfieldsandstructuredsupportvectormachinesforchemicalentityrecognitioninbiomedicalliterature AT wangjingqi acomparisonofconditionalrandomfieldsandstructuredsupportvectormachinesforchemicalentityrecognitioninbiomedicalliterature AT xuhua acomparisonofconditionalrandomfieldsandstructuredsupportvectormachinesforchemicalentityrecognitioninbiomedicalliterature AT tangbuzhou comparisonofconditionalrandomfieldsandstructuredsupportvectormachinesforchemicalentityrecognitioninbiomedicalliterature AT fengyudong comparisonofconditionalrandomfieldsandstructuredsupportvectormachinesforchemicalentityrecognitioninbiomedicalliterature AT wangxiaolong comparisonofconditionalrandomfieldsandstructuredsupportvectormachinesforchemicalentityrecognitioninbiomedicalliterature AT wuyonghui comparisonofconditionalrandomfieldsandstructuredsupportvectormachinesforchemicalentityrecognitioninbiomedicalliterature AT zhangyaoyun comparisonofconditionalrandomfieldsandstructuredsupportvectormachinesforchemicalentityrecognitioninbiomedicalliterature AT jiangmin comparisonofconditionalrandomfieldsandstructuredsupportvectormachinesforchemicalentityrecognitioninbiomedicalliterature AT wangjingqi comparisonofconditionalrandomfieldsandstructuredsupportvectormachinesforchemicalentityrecognitioninbiomedicalliterature AT xuhua comparisonofconditionalrandomfieldsandstructuredsupportvectormachinesforchemicalentityrecognitioninbiomedicalliterature

A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature

Ejemplares similares