Cargando…

An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets

BACKGROUND: Recent biochemical advances have led to inexpensive, time-efficient production of massive volumes of raw genomic data. Traditional machine learning approaches to genome annotation typically rely on large amounts of labeled data. The process of labeling data can be expensive, as it requir...

Descripción completa

Detalles Bibliográficos
Autores principales:	Stanescu, Ana, Caragea, Doina
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4565116/ https://www.ncbi.nlm.nih.gov/pubmed/26356316 http://dx.doi.org/10.1186/1752-0509-9-S5-S1

_version_	1782389560130404352
author	Stanescu, Ana Caragea, Doina
author_facet	Stanescu, Ana Caragea, Doina
author_sort	Stanescu, Ana
collection	PubMed
description	BACKGROUND: Recent biochemical advances have led to inexpensive, time-efficient production of massive volumes of raw genomic data. Traditional machine learning approaches to genome annotation typically rely on large amounts of labeled data. The process of labeling data can be expensive, as it requires domain knowledge and expert involvement. Semi-supervised learning approaches that can make use of unlabeled data, in addition to small amounts of labeled data, can help reduce the costs associated with labeling. In this context, we focus on the problem of predicting splice sites in a genome using semi-supervised learning approaches. This is a challenging problem, due to the highly imbalanced distribution of the data, i.e., small number of splice sites as compared to the number of non-splice sites. To address this challenge, we propose to use ensembles of semi-supervised classifiers, specifically self-training and co-training classifiers. RESULTS: Our experiments on five highly imbalanced splice site datasets, with positive to negative ratios of 1-to-99, showed that the ensemble-based semi-supervised approaches represent a good choice, even when the amount of labeled data consists of less than 1% of all training data. In particular, we found that ensembles of co-training and self-training classifiers that dynamically balance the set of labeled instances during the semi-supervised iterations show improvements over the corresponding supervised ensemble baselines. CONCLUSIONS: In the presence of limited amounts of labeled data, ensemble-based semi-supervised approaches can successfully leverage the unlabeled data to enhance supervised ensembles learned from highly imbalanced data distributions. Given that such distributions are common for many biological sequence classification problems, our work can be seen as a stepping stone towards more sophisticated ensemble-based approaches to biological sequence annotation in a semi-supervised framework.
format	Online Article Text
id	pubmed-4565116
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-45651162015-09-18 An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets Stanescu, Ana Caragea, Doina BMC Syst Biol Research BACKGROUND: Recent biochemical advances have led to inexpensive, time-efficient production of massive volumes of raw genomic data. Traditional machine learning approaches to genome annotation typically rely on large amounts of labeled data. The process of labeling data can be expensive, as it requires domain knowledge and expert involvement. Semi-supervised learning approaches that can make use of unlabeled data, in addition to small amounts of labeled data, can help reduce the costs associated with labeling. In this context, we focus on the problem of predicting splice sites in a genome using semi-supervised learning approaches. This is a challenging problem, due to the highly imbalanced distribution of the data, i.e., small number of splice sites as compared to the number of non-splice sites. To address this challenge, we propose to use ensembles of semi-supervised classifiers, specifically self-training and co-training classifiers. RESULTS: Our experiments on five highly imbalanced splice site datasets, with positive to negative ratios of 1-to-99, showed that the ensemble-based semi-supervised approaches represent a good choice, even when the amount of labeled data consists of less than 1% of all training data. In particular, we found that ensembles of co-training and self-training classifiers that dynamically balance the set of labeled instances during the semi-supervised iterations show improvements over the corresponding supervised ensemble baselines. CONCLUSIONS: In the presence of limited amounts of labeled data, ensemble-based semi-supervised approaches can successfully leverage the unlabeled data to enhance supervised ensembles learned from highly imbalanced data distributions. Given that such distributions are common for many biological sequence classification problems, our work can be seen as a stepping stone towards more sophisticated ensemble-based approaches to biological sequence annotation in a semi-supervised framework. BioMed Central 2015-09-01 /pmc/articles/PMC4565116/ /pubmed/26356316 http://dx.doi.org/10.1186/1752-0509-9-S5-S1 Text en Copyright © 2015 Stanescu and Caragea. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Stanescu, Ana Caragea, Doina An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets
title	An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets
title_full	An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets
title_fullStr	An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets
title_full_unstemmed	An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets
title_short	An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets
title_sort	empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4565116/ https://www.ncbi.nlm.nih.gov/pubmed/26356316 http://dx.doi.org/10.1186/1752-0509-9-S5-S1
work_keys_str_mv	AT stanescuana anempiricalstudyofensemblebasedsemisupervisedlearningapproachesforimbalancedsplicesitedatasets AT carageadoina anempiricalstudyofensemblebasedsemisupervisedlearningapproachesforimbalancedsplicesitedatasets AT stanescuana empiricalstudyofensemblebasedsemisupervisedlearningapproachesforimbalancedsplicesitedatasets AT carageadoina empiricalstudyofensemblebasedsemisupervisedlearningapproachesforimbalancedsplicesitedatasets

An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets

Ejemplares similares