Cargando…

Genome-wide sequence-based prediction of peripheral proteins using a novel semi-supervised learning technique

BACKGROUND: In supervised learning, traditional approaches to building a classifier use two sets of examples with pre-defined classes along with a learning algorithm. The main limitation of this approach is that examples from both classes are required which might be infeasible in certain cases, espe...

Descripción completa

Detalles Bibliográficos
Autores principales: Bhardwaj, Nitin, Gerstein, Mark, Lu, Hui
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3009533/
https://www.ncbi.nlm.nih.gov/pubmed/20122235
http://dx.doi.org/10.1186/1471-2105-11-S1-S6
_version_ 1782194700866813952
author Bhardwaj, Nitin
Gerstein, Mark
Lu, Hui
author_facet Bhardwaj, Nitin
Gerstein, Mark
Lu, Hui
author_sort Bhardwaj, Nitin
collection PubMed
description BACKGROUND: In supervised learning, traditional approaches to building a classifier use two sets of examples with pre-defined classes along with a learning algorithm. The main limitation of this approach is that examples from both classes are required which might be infeasible in certain cases, especially those dealing with biological data. Such is the case for membrane-binding peripheral domains that play important roles in many biological processes, including cell signaling and membrane trafficking by reversibly binding to membranes. For these domains, a well-defined positive set is available with domains known to bind membrane along with a large unlabeled set of domains whose membrane binding affinities have not been measured. The aforementioned limitation can be addressed by a special class of semi-supervised machine learning called positive-unlabeled (PU) learning that uses a positive set with a large unlabeled set. METHODS: In this study, we implement the first application of PU-learning to a protein function prediction problem: identification of peripheral domains. PU-learning starts by identifying reliable negative (RN) examples iteratively from the unlabeled set until convergence and builds a classifier using the positive and the final RN set. A data set of 232 positive cases and ~3750 unlabeled ones were used to construct and validate the protocol. RESULTS: Holdout evaluation of the protocol on a left-out positive set showed that the accuracy of prediction reached up to 95% during two independent implementations. CONCLUSION: These results suggest that our protocol can be used for predicting membrane-binding properties of a wide variety of modular domains. Protocols like the one presented here become particularly useful in the case of availability of information from one class only.
format Text
id pubmed-3009533
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-30095332010-12-23 Genome-wide sequence-based prediction of peripheral proteins using a novel semi-supervised learning technique Bhardwaj, Nitin Gerstein, Mark Lu, Hui BMC Bioinformatics Research BACKGROUND: In supervised learning, traditional approaches to building a classifier use two sets of examples with pre-defined classes along with a learning algorithm. The main limitation of this approach is that examples from both classes are required which might be infeasible in certain cases, especially those dealing with biological data. Such is the case for membrane-binding peripheral domains that play important roles in many biological processes, including cell signaling and membrane trafficking by reversibly binding to membranes. For these domains, a well-defined positive set is available with domains known to bind membrane along with a large unlabeled set of domains whose membrane binding affinities have not been measured. The aforementioned limitation can be addressed by a special class of semi-supervised machine learning called positive-unlabeled (PU) learning that uses a positive set with a large unlabeled set. METHODS: In this study, we implement the first application of PU-learning to a protein function prediction problem: identification of peripheral domains. PU-learning starts by identifying reliable negative (RN) examples iteratively from the unlabeled set until convergence and builds a classifier using the positive and the final RN set. A data set of 232 positive cases and ~3750 unlabeled ones were used to construct and validate the protocol. RESULTS: Holdout evaluation of the protocol on a left-out positive set showed that the accuracy of prediction reached up to 95% during two independent implementations. CONCLUSION: These results suggest that our protocol can be used for predicting membrane-binding properties of a wide variety of modular domains. Protocols like the one presented here become particularly useful in the case of availability of information from one class only. BioMed Central 2010-01-18 /pmc/articles/PMC3009533/ /pubmed/20122235 http://dx.doi.org/10.1186/1471-2105-11-S1-S6 Text en Copyright ©2010 Bhardwaj et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Bhardwaj, Nitin
Gerstein, Mark
Lu, Hui
Genome-wide sequence-based prediction of peripheral proteins using a novel semi-supervised learning technique
title Genome-wide sequence-based prediction of peripheral proteins using a novel semi-supervised learning technique
title_full Genome-wide sequence-based prediction of peripheral proteins using a novel semi-supervised learning technique
title_fullStr Genome-wide sequence-based prediction of peripheral proteins using a novel semi-supervised learning technique
title_full_unstemmed Genome-wide sequence-based prediction of peripheral proteins using a novel semi-supervised learning technique
title_short Genome-wide sequence-based prediction of peripheral proteins using a novel semi-supervised learning technique
title_sort genome-wide sequence-based prediction of peripheral proteins using a novel semi-supervised learning technique
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3009533/
https://www.ncbi.nlm.nih.gov/pubmed/20122235
http://dx.doi.org/10.1186/1471-2105-11-S1-S6
work_keys_str_mv AT bhardwajnitin genomewidesequencebasedpredictionofperipheralproteinsusinganovelsemisupervisedlearningtechnique
AT gersteinmark genomewidesequencebasedpredictionofperipheralproteinsusinganovelsemisupervisedlearningtechnique
AT luhui genomewidesequencebasedpredictionofperipheralproteinsusinganovelsemisupervisedlearningtechnique