Cargando…

A CRF-based system for recognizing chemical entity mentions (CEMs) in biomedical literature

BACKGROUND: In order to improve information access on chemical compounds and drugs (chemical entities) described in text repositories, it is very crucial to be able to identify chemical entity mentions (CEMs) automatically within text. The CHEMDNER challenge in BioCreative IV was specially designed...

Descripción completa

Detalles Bibliográficos
Autores principales:	Xu, Shuo, An, Xin, Zhu, Lijun, Zhang, Yunliang, Zhang, Haodong
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331687/ https://www.ncbi.nlm.nih.gov/pubmed/25810768 http://dx.doi.org/10.1186/1758-2946-7-S1-S11

_version_	1782357758792695808
author	Xu, Shuo An, Xin Zhu, Lijun Zhang, Yunliang Zhang, Haodong
author_facet	Xu, Shuo An, Xin Zhu, Lijun Zhang, Yunliang Zhang, Haodong
author_sort	Xu, Shuo
collection	PubMed
description	BACKGROUND: In order to improve information access on chemical compounds and drugs (chemical entities) described in text repositories, it is very crucial to be able to identify chemical entity mentions (CEMs) automatically within text. The CHEMDNER challenge in BioCreative IV was specially designed to promote the implementation of corresponding systems that are able to detect mentions of chemical compounds and drugs, which has two subtasks: CDI (Chemical Document Indexing) and CEM. RESULTS: Our system processing pipeline consists of three major components: pre-processing (sentence detection, tokenization), recognition (CRF-based approach), and post-processing (rule-based approach and format conversion). In our post-challenge system, the cost parameter in CRF model was optimized by 10-fold cross validation with grid search, and word representations feature induced by Brown clustering method was introduced. For the CEM subtask, our official runs were ranked in top position by obtaining maximum 88.79% precision, 69.08% recall and 77.70% balanced F-measure, which were improved further to 88.43% precision, 76.48% recall and 82.02% balanced F-measure in our post-challenge system. CONCLUSIONS: In our system, instead of extracting a CEM as a whole, we regarded it as a sequence labeling problem. Though our current system has much room for improvement, our system is valuable in showing that the performance in term of balanced F-measure can be improved largely by utilizing large amounts of relatively inexpensive un-annotated PubMed abstracts and optimizing the cost parameter in CRF model. From our practice and lessons, if one directly utilizes some open-source natural language processing (NLP) toolkits, such as OpenNLP, Standford CoreNLP, false positive (FP) rate may be very high. It is better to develop some additional rules to minimize the FP rate if one does not want to re-train the related models. Our CEM recognition system is available at: http://www.SciTeMiner.org/XuShuo/Demo/CEM.
format	Online Article Text
id	pubmed-4331687
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-43316872015-03-25 A CRF-based system for recognizing chemical entity mentions (CEMs) in biomedical literature Xu, Shuo An, Xin Zhu, Lijun Zhang, Yunliang Zhang, Haodong J Cheminform Research BACKGROUND: In order to improve information access on chemical compounds and drugs (chemical entities) described in text repositories, it is very crucial to be able to identify chemical entity mentions (CEMs) automatically within text. The CHEMDNER challenge in BioCreative IV was specially designed to promote the implementation of corresponding systems that are able to detect mentions of chemical compounds and drugs, which has two subtasks: CDI (Chemical Document Indexing) and CEM. RESULTS: Our system processing pipeline consists of three major components: pre-processing (sentence detection, tokenization), recognition (CRF-based approach), and post-processing (rule-based approach and format conversion). In our post-challenge system, the cost parameter in CRF model was optimized by 10-fold cross validation with grid search, and word representations feature induced by Brown clustering method was introduced. For the CEM subtask, our official runs were ranked in top position by obtaining maximum 88.79% precision, 69.08% recall and 77.70% balanced F-measure, which were improved further to 88.43% precision, 76.48% recall and 82.02% balanced F-measure in our post-challenge system. CONCLUSIONS: In our system, instead of extracting a CEM as a whole, we regarded it as a sequence labeling problem. Though our current system has much room for improvement, our system is valuable in showing that the performance in term of balanced F-measure can be improved largely by utilizing large amounts of relatively inexpensive un-annotated PubMed abstracts and optimizing the cost parameter in CRF model. From our practice and lessons, if one directly utilizes some open-source natural language processing (NLP) toolkits, such as OpenNLP, Standford CoreNLP, false positive (FP) rate may be very high. It is better to develop some additional rules to minimize the FP rate if one does not want to re-train the related models. Our CEM recognition system is available at: http://www.SciTeMiner.org/XuShuo/Demo/CEM. BioMed Central 2015-01-19 /pmc/articles/PMC4331687/ /pubmed/25810768 http://dx.doi.org/10.1186/1758-2946-7-S1-S11 Text en Copyright © 2015 Xu et al.; licensee Springer. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Xu, Shuo An, Xin Zhu, Lijun Zhang, Yunliang Zhang, Haodong A CRF-based system for recognizing chemical entity mentions (CEMs) in biomedical literature
title	A CRF-based system for recognizing chemical entity mentions (CEMs) in biomedical literature
title_full	A CRF-based system for recognizing chemical entity mentions (CEMs) in biomedical literature
title_fullStr	A CRF-based system for recognizing chemical entity mentions (CEMs) in biomedical literature
title_full_unstemmed	A CRF-based system for recognizing chemical entity mentions (CEMs) in biomedical literature
title_short	A CRF-based system for recognizing chemical entity mentions (CEMs) in biomedical literature
title_sort	crf-based system for recognizing chemical entity mentions (cems) in biomedical literature
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331687/ https://www.ncbi.nlm.nih.gov/pubmed/25810768 http://dx.doi.org/10.1186/1758-2946-7-S1-S11
work_keys_str_mv	AT xushuo acrfbasedsystemforrecognizingchemicalentitymentionscemsinbiomedicalliterature AT anxin acrfbasedsystemforrecognizingchemicalentitymentionscemsinbiomedicalliterature AT zhulijun acrfbasedsystemforrecognizingchemicalentitymentionscemsinbiomedicalliterature AT zhangyunliang acrfbasedsystemforrecognizingchemicalentitymentionscemsinbiomedicalliterature AT zhanghaodong acrfbasedsystemforrecognizingchemicalentitymentionscemsinbiomedicalliterature AT xushuo crfbasedsystemforrecognizingchemicalentitymentionscemsinbiomedicalliterature AT anxin crfbasedsystemforrecognizingchemicalentitymentionscemsinbiomedicalliterature AT zhulijun crfbasedsystemforrecognizingchemicalentitymentionscemsinbiomedicalliterature AT zhangyunliang crfbasedsystemforrecognizingchemicalentitymentionscemsinbiomedicalliterature AT zhanghaodong crfbasedsystemforrecognizingchemicalentitymentionscemsinbiomedicalliterature

A CRF-based system for recognizing chemical entity mentions (CEMs) in biomedical literature

Ejemplares similares