Cargando…

An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets

BACKGROUND: Recent biochemical advances have led to inexpensive, time-efficient production of massive volumes of raw genomic data. Traditional machine learning approaches to genome annotation typically rely on large amounts of labeled data. The process of labeling data can be expensive, as it requir...

Descripción completa

Detalles Bibliográficos
Autores principales: Stanescu, Ana, Caragea, Doina
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4565116/
https://www.ncbi.nlm.nih.gov/pubmed/26356316
http://dx.doi.org/10.1186/1752-0509-9-S5-S1
_version_ 1782389560130404352
author Stanescu, Ana
Caragea, Doina
author_facet Stanescu, Ana
Caragea, Doina
author_sort Stanescu, Ana
collection PubMed
description BACKGROUND: Recent biochemical advances have led to inexpensive, time-efficient production of massive volumes of raw genomic data. Traditional machine learning approaches to genome annotation typically rely on large amounts of labeled data. The process of labeling data can be expensive, as it requires domain knowledge and expert involvement. Semi-supervised learning approaches that can make use of unlabeled data, in addition to small amounts of labeled data, can help reduce the costs associated with labeling. In this context, we focus on the problem of predicting splice sites in a genome using semi-supervised learning approaches. This is a challenging problem, due to the highly imbalanced distribution of the data, i.e., small number of splice sites as compared to the number of non-splice sites. To address this challenge, we propose to use ensembles of semi-supervised classifiers, specifically self-training and co-training classifiers. RESULTS: Our experiments on five highly imbalanced splice site datasets, with positive to negative ratios of 1-to-99, showed that the ensemble-based semi-supervised approaches represent a good choice, even when the amount of labeled data consists of less than 1% of all training data. In particular, we found that ensembles of co-training and self-training classifiers that dynamically balance the set of labeled instances during the semi-supervised iterations show improvements over the corresponding supervised ensemble baselines. CONCLUSIONS: In the presence of limited amounts of labeled data, ensemble-based semi-supervised approaches can successfully leverage the unlabeled data to enhance supervised ensembles learned from highly imbalanced data distributions. Given that such distributions are common for many biological sequence classification problems, our work can be seen as a stepping stone towards more sophisticated ensemble-based approaches to biological sequence annotation in a semi-supervised framework.
format Online
Article
Text
id pubmed-4565116
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-45651162015-09-18 An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets Stanescu, Ana Caragea, Doina BMC Syst Biol Research BACKGROUND: Recent biochemical advances have led to inexpensive, time-efficient production of massive volumes of raw genomic data. Traditional machine learning approaches to genome annotation typically rely on large amounts of labeled data. The process of labeling data can be expensive, as it requires domain knowledge and expert involvement. Semi-supervised learning approaches that can make use of unlabeled data, in addition to small amounts of labeled data, can help reduce the costs associated with labeling. In this context, we focus on the problem of predicting splice sites in a genome using semi-supervised learning approaches. This is a challenging problem, due to the highly imbalanced distribution of the data, i.e., small number of splice sites as compared to the number of non-splice sites. To address this challenge, we propose to use ensembles of semi-supervised classifiers, specifically self-training and co-training classifiers. RESULTS: Our experiments on five highly imbalanced splice site datasets, with positive to negative ratios of 1-to-99, showed that the ensemble-based semi-supervised approaches represent a good choice, even when the amount of labeled data consists of less than 1% of all training data. In particular, we found that ensembles of co-training and self-training classifiers that dynamically balance the set of labeled instances during the semi-supervised iterations show improvements over the corresponding supervised ensemble baselines. CONCLUSIONS: In the presence of limited amounts of labeled data, ensemble-based semi-supervised approaches can successfully leverage the unlabeled data to enhance supervised ensembles learned from highly imbalanced data distributions. Given that such distributions are common for many biological sequence classification problems, our work can be seen as a stepping stone towards more sophisticated ensemble-based approaches to biological sequence annotation in a semi-supervised framework. BioMed Central 2015-09-01 /pmc/articles/PMC4565116/ /pubmed/26356316 http://dx.doi.org/10.1186/1752-0509-9-S5-S1 Text en Copyright © 2015 Stanescu and Caragea. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Stanescu, Ana
Caragea, Doina
An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets
title An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets
title_full An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets
title_fullStr An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets
title_full_unstemmed An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets
title_short An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets
title_sort empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4565116/
https://www.ncbi.nlm.nih.gov/pubmed/26356316
http://dx.doi.org/10.1186/1752-0509-9-S5-S1
work_keys_str_mv AT stanescuana anempiricalstudyofensemblebasedsemisupervisedlearningapproachesforimbalancedsplicesitedatasets
AT carageadoina anempiricalstudyofensemblebasedsemisupervisedlearningapproachesforimbalancedsplicesitedatasets
AT stanescuana empiricalstudyofensemblebasedsemisupervisedlearningapproachesforimbalancedsplicesitedatasets
AT carageadoina empiricalstudyofensemblebasedsemisupervisedlearningapproachesforimbalancedsplicesitedatasets