Cargando…

Improving chemical disease relation extraction with rich features and weakly labeled data

BACKGROUND: Due to the importance of identifying relations between chemicals and diseases for new drug discovery and improving chemical safety, there has been a growing interest in developing automatic relation extraction systems for capturing these relations from the rich and rapid-growing biomedic...

Descripción completa

Detalles Bibliográficos
Autores principales:	Peng, Yifan, Wei, Chih-Hsuan, Lu, Zhiyong
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer International Publishing 2016
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5054544/ https://www.ncbi.nlm.nih.gov/pubmed/28316651 http://dx.doi.org/10.1186/s13321-016-0165-z

_version_	1782458621780557824
author	Peng, Yifan Wei, Chih-Hsuan Lu, Zhiyong
author_facet	Peng, Yifan Wei, Chih-Hsuan Lu, Zhiyong
author_sort	Peng, Yifan
collection	PubMed
description	BACKGROUND: Due to the importance of identifying relations between chemicals and diseases for new drug discovery and improving chemical safety, there has been a growing interest in developing automatic relation extraction systems for capturing these relations from the rich and rapid-growing biomedical literature. In this work we aim to build on current advances in named entity recognition and a recent BioCreative effort to further improve the state of the art in biomedical relation extraction, in particular for the chemical-induced disease (CID) relations. RESULTS: We propose a rich-feature approach with Support Vector Machine to aid in the extraction of CIDs from PubMed articles. Our feature vector includes novel statistical features, linguistic knowledge, and domain resources. We also incorporate the output of a rule-based system as features, thus combining the advantages of rule- and machine learning-based systems. Furthermore, we augment our approach with automatically generated labeled text from an existing knowledge base to improve performance without additional cost for corpus construction. To evaluate our system, we perform experiments on the human-annotated BioCreative V benchmarking dataset and compare with previous results. When trained using only BioCreative V training and development sets, our system achieves an F-score of 57.51 %, which already compares favorably to previous methods. Our system performance was further improved to 61.01 % in F-score when augmented with additional automatically generated weakly labeled data. CONCLUSIONS: Our text-mining approach demonstrates state-of-the-art performance in disease-chemical relation extraction. More importantly, this work exemplifies the use of (freely available) curated document-level annotations in existing biomedical databases, which are largely overlooked in text-mining system development.
format	Online Article Text
id	pubmed-5054544
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	Springer International Publishing
record_format	MEDLINE/PubMed
spelling	pubmed-50545442017-03-17 Improving chemical disease relation extraction with rich features and weakly labeled data Peng, Yifan Wei, Chih-Hsuan Lu, Zhiyong J Cheminform Research Article BACKGROUND: Due to the importance of identifying relations between chemicals and diseases for new drug discovery and improving chemical safety, there has been a growing interest in developing automatic relation extraction systems for capturing these relations from the rich and rapid-growing biomedical literature. In this work we aim to build on current advances in named entity recognition and a recent BioCreative effort to further improve the state of the art in biomedical relation extraction, in particular for the chemical-induced disease (CID) relations. RESULTS: We propose a rich-feature approach with Support Vector Machine to aid in the extraction of CIDs from PubMed articles. Our feature vector includes novel statistical features, linguistic knowledge, and domain resources. We also incorporate the output of a rule-based system as features, thus combining the advantages of rule- and machine learning-based systems. Furthermore, we augment our approach with automatically generated labeled text from an existing knowledge base to improve performance without additional cost for corpus construction. To evaluate our system, we perform experiments on the human-annotated BioCreative V benchmarking dataset and compare with previous results. When trained using only BioCreative V training and development sets, our system achieves an F-score of 57.51 %, which already compares favorably to previous methods. Our system performance was further improved to 61.01 % in F-score when augmented with additional automatically generated weakly labeled data. CONCLUSIONS: Our text-mining approach demonstrates state-of-the-art performance in disease-chemical relation extraction. More importantly, this work exemplifies the use of (freely available) curated document-level annotations in existing biomedical databases, which are largely overlooked in text-mining system development. Springer International Publishing 2016-10-07 /pmc/articles/PMC5054544/ /pubmed/28316651 http://dx.doi.org/10.1186/s13321-016-0165-z Text en © The Author(s) 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Peng, Yifan Wei, Chih-Hsuan Lu, Zhiyong Improving chemical disease relation extraction with rich features and weakly labeled data
title	Improving chemical disease relation extraction with rich features and weakly labeled data
title_full	Improving chemical disease relation extraction with rich features and weakly labeled data
title_fullStr	Improving chemical disease relation extraction with rich features and weakly labeled data
title_full_unstemmed	Improving chemical disease relation extraction with rich features and weakly labeled data
title_short	Improving chemical disease relation extraction with rich features and weakly labeled data
title_sort	improving chemical disease relation extraction with rich features and weakly labeled data
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5054544/ https://www.ncbi.nlm.nih.gov/pubmed/28316651 http://dx.doi.org/10.1186/s13321-016-0165-z
work_keys_str_mv	AT pengyifan improvingchemicaldiseaserelationextractionwithrichfeaturesandweaklylabeleddata AT weichihhsuan improvingchemicaldiseaserelationextractionwithrichfeaturesandweaklylabeleddata AT luzhiyong improvingchemicaldiseaserelationextractionwithrichfeaturesandweaklylabeleddata

Improving chemical disease relation extraction with rich features and weakly labeled data

Ejemplares similares