Cargando…

Minimalist ensemble algorithms for genome-wide protein localization prediction

BACKGROUND: Computational prediction of protein subcellular localization can greatly help to elucidate its functions. Despite the existence of dozens of protein localization prediction algorithms, the prediction accuracy and coverage are still low. Several ensemble algorithms have been proposed to i...

Descripción completa

Detalles Bibliográficos
Autores principales: Lin, Jhih-Rong, Mondal, Ananda Mohan, Liu, Rong, Hu, Jianjun
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3426488/
https://www.ncbi.nlm.nih.gov/pubmed/22759391
http://dx.doi.org/10.1186/1471-2105-13-157
_version_ 1782241515005804544
author Lin, Jhih-Rong
Mondal, Ananda Mohan
Liu, Rong
Hu, Jianjun
author_facet Lin, Jhih-Rong
Mondal, Ananda Mohan
Liu, Rong
Hu, Jianjun
author_sort Lin, Jhih-Rong
collection PubMed
description BACKGROUND: Computational prediction of protein subcellular localization can greatly help to elucidate its functions. Despite the existence of dozens of protein localization prediction algorithms, the prediction accuracy and coverage are still low. Several ensemble algorithms have been proposed to improve the prediction performance, which usually include as many as 10 or more individual localization algorithms. However, their performance is still limited by the running complexity and redundancy among individual prediction algorithms. RESULTS: This paper proposed a novel method for rational design of minimalist ensemble algorithms for practical genome-wide protein subcellular localization prediction. The algorithm is based on combining a feature selection based filter and a logistic regression classifier. Using a novel concept of contribution scores, we analyzed issues of algorithm redundancy, consensus mistakes, and algorithm complementarity in designing ensemble algorithms. We applied the proposed minimalist logistic regression (LR) ensemble algorithm to two genome-wide datasets of Yeast and Human and compared its performance with current ensemble algorithms. Experimental results showed that the minimalist ensemble algorithm can achieve high prediction accuracy with only 1/3 to 1/2 of individual predictors of current ensemble algorithms, which greatly reduces computational complexity and running time. It was found that the high performance ensemble algorithms are usually composed of the predictors that together cover most of available features. Compared to the best individual predictor, our ensemble algorithm improved the prediction accuracy from AUC score of 0.558 to 0.707 for the Yeast dataset and from 0.628 to 0.646 for the Human dataset. Compared with popular weighted voting based ensemble algorithms, our classifier-based ensemble algorithms achieved much better performance without suffering from inclusion of too many individual predictors. CONCLUSIONS: We proposed a method for rational design of minimalist ensemble algorithms using feature selection and classifiers. The proposed minimalist ensemble algorithm based on logistic regression can achieve equal or better prediction performance while using only half or one-third of individual predictors compared to other ensemble algorithms. The results also suggested that meta-predictors that take advantage of a variety of features by combining individual predictors tend to achieve the best performance. The LR ensemble server and related benchmark datasets are available at http://mleg.cse.sc.edu/LRensemble/cgi-bin/predict.cgi.
format Online
Article
Text
id pubmed-3426488
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-34264882012-08-24 Minimalist ensemble algorithms for genome-wide protein localization prediction Lin, Jhih-Rong Mondal, Ananda Mohan Liu, Rong Hu, Jianjun BMC Bioinformatics Methodology Article BACKGROUND: Computational prediction of protein subcellular localization can greatly help to elucidate its functions. Despite the existence of dozens of protein localization prediction algorithms, the prediction accuracy and coverage are still low. Several ensemble algorithms have been proposed to improve the prediction performance, which usually include as many as 10 or more individual localization algorithms. However, their performance is still limited by the running complexity and redundancy among individual prediction algorithms. RESULTS: This paper proposed a novel method for rational design of minimalist ensemble algorithms for practical genome-wide protein subcellular localization prediction. The algorithm is based on combining a feature selection based filter and a logistic regression classifier. Using a novel concept of contribution scores, we analyzed issues of algorithm redundancy, consensus mistakes, and algorithm complementarity in designing ensemble algorithms. We applied the proposed minimalist logistic regression (LR) ensemble algorithm to two genome-wide datasets of Yeast and Human and compared its performance with current ensemble algorithms. Experimental results showed that the minimalist ensemble algorithm can achieve high prediction accuracy with only 1/3 to 1/2 of individual predictors of current ensemble algorithms, which greatly reduces computational complexity and running time. It was found that the high performance ensemble algorithms are usually composed of the predictors that together cover most of available features. Compared to the best individual predictor, our ensemble algorithm improved the prediction accuracy from AUC score of 0.558 to 0.707 for the Yeast dataset and from 0.628 to 0.646 for the Human dataset. Compared with popular weighted voting based ensemble algorithms, our classifier-based ensemble algorithms achieved much better performance without suffering from inclusion of too many individual predictors. CONCLUSIONS: We proposed a method for rational design of minimalist ensemble algorithms using feature selection and classifiers. The proposed minimalist ensemble algorithm based on logistic regression can achieve equal or better prediction performance while using only half or one-third of individual predictors compared to other ensemble algorithms. The results also suggested that meta-predictors that take advantage of a variety of features by combining individual predictors tend to achieve the best performance. The LR ensemble server and related benchmark datasets are available at http://mleg.cse.sc.edu/LRensemble/cgi-bin/predict.cgi. BioMed Central 2012-07-03 /pmc/articles/PMC3426488/ /pubmed/22759391 http://dx.doi.org/10.1186/1471-2105-13-157 Text en Copyright ©2012 Lin et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Lin, Jhih-Rong
Mondal, Ananda Mohan
Liu, Rong
Hu, Jianjun
Minimalist ensemble algorithms for genome-wide protein localization prediction
title Minimalist ensemble algorithms for genome-wide protein localization prediction
title_full Minimalist ensemble algorithms for genome-wide protein localization prediction
title_fullStr Minimalist ensemble algorithms for genome-wide protein localization prediction
title_full_unstemmed Minimalist ensemble algorithms for genome-wide protein localization prediction
title_short Minimalist ensemble algorithms for genome-wide protein localization prediction
title_sort minimalist ensemble algorithms for genome-wide protein localization prediction
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3426488/
https://www.ncbi.nlm.nih.gov/pubmed/22759391
http://dx.doi.org/10.1186/1471-2105-13-157
work_keys_str_mv AT linjhihrong minimalistensemblealgorithmsforgenomewideproteinlocalizationprediction
AT mondalanandamohan minimalistensemblealgorithmsforgenomewideproteinlocalizationprediction
AT liurong minimalistensemblealgorithmsforgenomewideproteinlocalizationprediction
AT hujianjun minimalistensemblealgorithmsforgenomewideproteinlocalizationprediction