Cargando…

Sparse regressions for predicting and interpreting subcellular localization of multi-label proteins

BACKGROUND: Predicting protein subcellular localization is indispensable for inferring protein functions. Recent studies have been focusing on predicting not only single-location proteins, but also multi-location proteins. Almost all of the high performing predictors proposed recently use gene ontol...

Descripción completa

Detalles Bibliográficos
Autores principales: Wan, Shibiao, Mak, Man-Wai, Kung, Sun-Yuan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4765148/
https://www.ncbi.nlm.nih.gov/pubmed/26911432
http://dx.doi.org/10.1186/s12859-016-0940-x
_version_ 1782417510618890240
author Wan, Shibiao
Mak, Man-Wai
Kung, Sun-Yuan
author_facet Wan, Shibiao
Mak, Man-Wai
Kung, Sun-Yuan
author_sort Wan, Shibiao
collection PubMed
description BACKGROUND: Predicting protein subcellular localization is indispensable for inferring protein functions. Recent studies have been focusing on predicting not only single-location proteins, but also multi-location proteins. Almost all of the high performing predictors proposed recently use gene ontology (GO) terms to construct feature vectors for classification. Despite their high performance, their prediction decisions are difficult to interpret because of the large number of GO terms involved. RESULTS: This paper proposes using sparse regressions to exploit GO information for both predicting and interpreting subcellular localization of single- and multi-location proteins. Specifically, we compared two multi-label sparse regression algorithms, namely multi-label LASSO (mLASSO) and multi-label elastic net (mEN), for large-scale predictions of protein subcellular localization. Both algorithms can yield sparse and interpretable solutions. By using the one-vs-rest strategy, mLASSO and mEN identified 87 and 429 out of more than 8,000 GO terms, respectively, which play essential roles in determining subcellular localization. More interestingly, many of the GO terms selected by mEN are from the biological process and molecular function categories, suggesting that the GO terms of these categories also play vital roles in the prediction. With these essential GO terms, not only where a protein locates can be decided, but also why it resides there can be revealed. CONCLUSIONS: Experimental results show that the output of both mEN and mLASSO are interpretable and they perform significantly better than existing state-of-the-art predictors. Moreover, mEN selects more features and performs better than mLASSO on a stringent human benchmark dataset. For readers’ convenience, an online server called SpaPredictor for both mLASSO and mEN is available at http://bioinfo.eie.polyu.edu.hk/SpaPredictorServer/.
format Online
Article
Text
id pubmed-4765148
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-47651482016-02-25 Sparse regressions for predicting and interpreting subcellular localization of multi-label proteins Wan, Shibiao Mak, Man-Wai Kung, Sun-Yuan BMC Bioinformatics Methodology Article BACKGROUND: Predicting protein subcellular localization is indispensable for inferring protein functions. Recent studies have been focusing on predicting not only single-location proteins, but also multi-location proteins. Almost all of the high performing predictors proposed recently use gene ontology (GO) terms to construct feature vectors for classification. Despite their high performance, their prediction decisions are difficult to interpret because of the large number of GO terms involved. RESULTS: This paper proposes using sparse regressions to exploit GO information for both predicting and interpreting subcellular localization of single- and multi-location proteins. Specifically, we compared two multi-label sparse regression algorithms, namely multi-label LASSO (mLASSO) and multi-label elastic net (mEN), for large-scale predictions of protein subcellular localization. Both algorithms can yield sparse and interpretable solutions. By using the one-vs-rest strategy, mLASSO and mEN identified 87 and 429 out of more than 8,000 GO terms, respectively, which play essential roles in determining subcellular localization. More interestingly, many of the GO terms selected by mEN are from the biological process and molecular function categories, suggesting that the GO terms of these categories also play vital roles in the prediction. With these essential GO terms, not only where a protein locates can be decided, but also why it resides there can be revealed. CONCLUSIONS: Experimental results show that the output of both mEN and mLASSO are interpretable and they perform significantly better than existing state-of-the-art predictors. Moreover, mEN selects more features and performs better than mLASSO on a stringent human benchmark dataset. For readers’ convenience, an online server called SpaPredictor for both mLASSO and mEN is available at http://bioinfo.eie.polyu.edu.hk/SpaPredictorServer/. BioMed Central 2016-02-24 /pmc/articles/PMC4765148/ /pubmed/26911432 http://dx.doi.org/10.1186/s12859-016-0940-x Text en © Wan et al. 2016 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
Wan, Shibiao
Mak, Man-Wai
Kung, Sun-Yuan
Sparse regressions for predicting and interpreting subcellular localization of multi-label proteins
title Sparse regressions for predicting and interpreting subcellular localization of multi-label proteins
title_full Sparse regressions for predicting and interpreting subcellular localization of multi-label proteins
title_fullStr Sparse regressions for predicting and interpreting subcellular localization of multi-label proteins
title_full_unstemmed Sparse regressions for predicting and interpreting subcellular localization of multi-label proteins
title_short Sparse regressions for predicting and interpreting subcellular localization of multi-label proteins
title_sort sparse regressions for predicting and interpreting subcellular localization of multi-label proteins
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4765148/
https://www.ncbi.nlm.nih.gov/pubmed/26911432
http://dx.doi.org/10.1186/s12859-016-0940-x
work_keys_str_mv AT wanshibiao sparseregressionsforpredictingandinterpretingsubcellularlocalizationofmultilabelproteins
AT makmanwai sparseregressionsforpredictingandinterpretingsubcellularlocalizationofmultilabelproteins
AT kungsunyuan sparseregressionsforpredictingandinterpretingsubcellularlocalizationofmultilabelproteins