Cargando…

Self-training in significance space of support vectors for imbalanced biomedical event data

BACKGROUND: Pairwise relationships extracted from biomedical literature are insufficient in formulating biomolecular interactions. Extraction of complex relations (namely, biomedical events) has become the main focus of the text-mining community. However, there are two critical issues that are seldo...

Descripción completa

Detalles Bibliográficos
Autores principales:	Munkhdalai, Tsendsuren, Namsrai, Oyun-Erdene, Ryu, Keun Ho
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4423724/ https://www.ncbi.nlm.nih.gov/pubmed/25952719 http://dx.doi.org/10.1186/1471-2105-16-S7-S6

_version_	1782370248363606016
author	Munkhdalai, Tsendsuren Namsrai, Oyun-Erdene Ryu, Keun Ho
author_facet	Munkhdalai, Tsendsuren Namsrai, Oyun-Erdene Ryu, Keun Ho
author_sort	Munkhdalai, Tsendsuren
collection	PubMed
description	BACKGROUND: Pairwise relationships extracted from biomedical literature are insufficient in formulating biomolecular interactions. Extraction of complex relations (namely, biomedical events) has become the main focus of the text-mining community. However, there are two critical issues that are seldom dealt with by existing systems. First, an annotated corpus for training a prediction model is highly imbalanced. Second, supervised models trained on only a single annotated corpus can limit system performance. Fortunately, there is a large pool of unlabeled data containing much of the domain background that one can exploit. RESULTS: In this study, we develop a new semi-supervised learning method to address the issues outlined above. The proposed algorithm efficiently exploits the unlabeled data to leverage system performance. We furthermore extend our algorithm to a two-phase learning framework. The first phase balances the training data for initial model induction. The second phase incorporates domain knowledge into the event extraction model. The effectiveness of our method is evaluated on the Genia event extraction corpus and a PubMed document pool. Our method can identify a small subset of the majority class, which is sufficient for building a well-generalized prediction model. It outperforms the traditional self-training algorithm in terms of f-measure. Our model, based on the training data and the unlabeled data pool, achieves comparable performance to the state-of-the-art systems that are trained on a larger annotated set consisting of training and evaluation data.
format	Online Article Text
id	pubmed-4423724
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-44237242015-05-13 Self-training in significance space of support vectors for imbalanced biomedical event data Munkhdalai, Tsendsuren Namsrai, Oyun-Erdene Ryu, Keun Ho BMC Bioinformatics Research BACKGROUND: Pairwise relationships extracted from biomedical literature are insufficient in formulating biomolecular interactions. Extraction of complex relations (namely, biomedical events) has become the main focus of the text-mining community. However, there are two critical issues that are seldom dealt with by existing systems. First, an annotated corpus for training a prediction model is highly imbalanced. Second, supervised models trained on only a single annotated corpus can limit system performance. Fortunately, there is a large pool of unlabeled data containing much of the domain background that one can exploit. RESULTS: In this study, we develop a new semi-supervised learning method to address the issues outlined above. The proposed algorithm efficiently exploits the unlabeled data to leverage system performance. We furthermore extend our algorithm to a two-phase learning framework. The first phase balances the training data for initial model induction. The second phase incorporates domain knowledge into the event extraction model. The effectiveness of our method is evaluated on the Genia event extraction corpus and a PubMed document pool. Our method can identify a small subset of the majority class, which is sufficient for building a well-generalized prediction model. It outperforms the traditional self-training algorithm in terms of f-measure. Our model, based on the training data and the unlabeled data pool, achieves comparable performance to the state-of-the-art systems that are trained on a larger annotated set consisting of training and evaluation data. BioMed Central 2015-04-23 /pmc/articles/PMC4423724/ /pubmed/25952719 http://dx.doi.org/10.1186/1471-2105-16-S7-S6 Text en Copyright © 2015 Munkhdalai et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Munkhdalai, Tsendsuren Namsrai, Oyun-Erdene Ryu, Keun Ho Self-training in significance space of support vectors for imbalanced biomedical event data
title	Self-training in significance space of support vectors for imbalanced biomedical event data
title_full	Self-training in significance space of support vectors for imbalanced biomedical event data
title_fullStr	Self-training in significance space of support vectors for imbalanced biomedical event data
title_full_unstemmed	Self-training in significance space of support vectors for imbalanced biomedical event data
title_short	Self-training in significance space of support vectors for imbalanced biomedical event data
title_sort	self-training in significance space of support vectors for imbalanced biomedical event data
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4423724/ https://www.ncbi.nlm.nih.gov/pubmed/25952719 http://dx.doi.org/10.1186/1471-2105-16-S7-S6
work_keys_str_mv	AT munkhdalaitsendsuren selftraininginsignificancespaceofsupportvectorsforimbalancedbiomedicaleventdata AT namsraioyunerdene selftraininginsignificancespaceofsupportvectorsforimbalancedbiomedicaleventdata AT ryukeunho selftraininginsignificancespaceofsupportvectorsforimbalancedbiomedicaleventdata

Self-training in significance space of support vectors for imbalanced biomedical event data

Ejemplares similares