Cargando…

CHEMDNER system with mixed conditional random fields and multi-scale word clustering

BACKGROUND: The chemical compound and drug name recognition plays an important role in chemical text mining, and it is the basis for automatic relation extraction and event identification in chemical information processing. So a high-performance named entity recognition system for chemical compound...

Descripción completa

Detalles Bibliográficos
Autores principales: Lu, Yanan, Ji, Donghong, Yao, Xiaoyuan, Wei, Xiaomei, Liang, Xiaohui
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331694/
https://www.ncbi.nlm.nih.gov/pubmed/25810775
http://dx.doi.org/10.1186/1758-2946-7-S1-S4
_version_ 1782357760408551424
author Lu, Yanan
Ji, Donghong
Yao, Xiaoyuan
Wei, Xiaomei
Liang, Xiaohui
author_facet Lu, Yanan
Ji, Donghong
Yao, Xiaoyuan
Wei, Xiaomei
Liang, Xiaohui
author_sort Lu, Yanan
collection PubMed
description BACKGROUND: The chemical compound and drug name recognition plays an important role in chemical text mining, and it is the basis for automatic relation extraction and event identification in chemical information processing. So a high-performance named entity recognition system for chemical compound and drug names is necessary. METHODS: We developed a CHEMDNER system based on mixed conditional random fields (CRF) with word clustering for chemical compound and drug name recognition. For the word clustering, we used Brown's hierarchical algorithm and Skip-gram model based on deep learning with massive PubMed articles including titles and abstracts. RESULTS: This system achieved the highest F-score of 88.20% for the CDI task and the second highest F-score of 87.11% for the CEM task in BioCreative IV. The performance was further improved by multi-scale clustering based on deep learning, achieving the F-score of 88.71% for CDI and 88.06% for CEM. CONCLUSIONS: The mixed CRF model represents both the internal complexity and external contexts of the entities, and the model is integrated with word clustering to capture domain knowledge with PubMed articles including titles and abstracts. The domain knowledge helps to ensure the performance of the entity recognition, even without fine-grained linguistic features and manually designed rules.
format Online
Article
Text
id pubmed-4331694
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-43316942015-03-25 CHEMDNER system with mixed conditional random fields and multi-scale word clustering Lu, Yanan Ji, Donghong Yao, Xiaoyuan Wei, Xiaomei Liang, Xiaohui J Cheminform Research BACKGROUND: The chemical compound and drug name recognition plays an important role in chemical text mining, and it is the basis for automatic relation extraction and event identification in chemical information processing. So a high-performance named entity recognition system for chemical compound and drug names is necessary. METHODS: We developed a CHEMDNER system based on mixed conditional random fields (CRF) with word clustering for chemical compound and drug name recognition. For the word clustering, we used Brown's hierarchical algorithm and Skip-gram model based on deep learning with massive PubMed articles including titles and abstracts. RESULTS: This system achieved the highest F-score of 88.20% for the CDI task and the second highest F-score of 87.11% for the CEM task in BioCreative IV. The performance was further improved by multi-scale clustering based on deep learning, achieving the F-score of 88.71% for CDI and 88.06% for CEM. CONCLUSIONS: The mixed CRF model represents both the internal complexity and external contexts of the entities, and the model is integrated with word clustering to capture domain knowledge with PubMed articles including titles and abstracts. The domain knowledge helps to ensure the performance of the entity recognition, even without fine-grained linguistic features and manually designed rules. BioMed Central 2015-01-19 /pmc/articles/PMC4331694/ /pubmed/25810775 http://dx.doi.org/10.1186/1758-2946-7-S1-S4 Text en Copyright © 2015 Lu et al.; licensee Springer. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Lu, Yanan
Ji, Donghong
Yao, Xiaoyuan
Wei, Xiaomei
Liang, Xiaohui
CHEMDNER system with mixed conditional random fields and multi-scale word clustering
title CHEMDNER system with mixed conditional random fields and multi-scale word clustering
title_full CHEMDNER system with mixed conditional random fields and multi-scale word clustering
title_fullStr CHEMDNER system with mixed conditional random fields and multi-scale word clustering
title_full_unstemmed CHEMDNER system with mixed conditional random fields and multi-scale word clustering
title_short CHEMDNER system with mixed conditional random fields and multi-scale word clustering
title_sort chemdner system with mixed conditional random fields and multi-scale word clustering
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331694/
https://www.ncbi.nlm.nih.gov/pubmed/25810775
http://dx.doi.org/10.1186/1758-2946-7-S1-S4
work_keys_str_mv AT luyanan chemdnersystemwithmixedconditionalrandomfieldsandmultiscalewordclustering
AT jidonghong chemdnersystemwithmixedconditionalrandomfieldsandmultiscalewordclustering
AT yaoxiaoyuan chemdnersystemwithmixedconditionalrandomfieldsandmultiscalewordclustering
AT weixiaomei chemdnersystemwithmixedconditionalrandomfieldsandmultiscalewordclustering
AT liangxiaohui chemdnersystemwithmixedconditionalrandomfieldsandmultiscalewordclustering