Cargando…

ProLoc-GO: Utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization

BACKGROUND: Gene Ontology (GO) annotation, which describes the function of genes and gene products across species, has recently been used to predict protein subcellular and subnuclear localization. Existing GO-based prediction methods for protein subcellular localization use the known accession numb...

Descripción completa

Detalles Bibliográficos
Autores principales: Huang, Wen-Lin, Tung, Chun-Wei, Ho, Shih-Wen, Hwang, Shiow-Fen, Ho, Shinn-Ying
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2262056/
https://www.ncbi.nlm.nih.gov/pubmed/18241343
http://dx.doi.org/10.1186/1471-2105-9-80
_version_ 1782151398086934528
author Huang, Wen-Lin
Tung, Chun-Wei
Ho, Shih-Wen
Hwang, Shiow-Fen
Ho, Shinn-Ying
author_facet Huang, Wen-Lin
Tung, Chun-Wei
Ho, Shih-Wen
Hwang, Shiow-Fen
Ho, Shinn-Ying
author_sort Huang, Wen-Lin
collection PubMed
description BACKGROUND: Gene Ontology (GO) annotation, which describes the function of genes and gene products across species, has recently been used to predict protein subcellular and subnuclear localization. Existing GO-based prediction methods for protein subcellular localization use the known accession numbers of query proteins to obtain their annotated GO terms. An accurate prediction method for predicting subcellular localization of novel proteins without known accession numbers, using only the input sequence, is worth developing. RESULTS: This study proposes an efficient sequence-based method (named ProLoc-GO) by mining informative GO terms for predicting protein subcellular localization. For each protein, BLAST is used to obtain a homology with a known accession number to the protein for retrieving the GO annotation. A large number n of all annotated GO terms that have ever appeared are then obtained from a large set of training proteins. A novel genetic algorithm based method (named GOmining) combined with a classifier of support vector machine (SVM) is proposed to simultaneously identify a small number m out of the n GO terms as input features to SVM, where m <<n. The m informative GO terms contain the essential GO terms annotating subcellular compartments such as GO:0005634 (Nucleus), GO:0005737 (Cytoplasm) and GO:0005856 (Cytoskeleton). Two existing data sets SCL12 (human protein with 12 locations) and SCL16 (Eukaryotic proteins with 16 locations) with <25% sequence identity are used to evaluate ProLoc-GO which has been implemented by using a single SVM classifier with the m = 44 and m = 60 informative GO terms, respectively. ProLoc-GO using input sequences yields test accuracies of 88.1% and 83.3% for SCL12 and SCL16, respectively, which are significantly better than the SVM-based methods, which achieve < 35% test accuracies using amino acid composition (AAC) with acid pairs and AAC with dipedtide composition. For comparison, ProLoc-GO using known accession numbers of query proteins yields test accuracies of 90.6% and 85.7%, which is also better than Hum-PLoc (85.0%) and Euk-OET-PLoc (83.7%) using ensemble classifiers with hybridization of GO terms and amphiphilic pseudo amino acid composition for SCL12 and SCL16, respectively. CONCLUSION: The growth of Gene Ontology in size and popularity has increased the effectiveness of GO-based features. GOmining can serve as a tool for selecting informative GO terms in solving sequence-based prediction problems. The prediction system using ProLoc-GO with input sequences of query proteins for protein subcellular localization has been implemented (see Availability).
format Text
id pubmed-2262056
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-22620562008-03-04 ProLoc-GO: Utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization Huang, Wen-Lin Tung, Chun-Wei Ho, Shih-Wen Hwang, Shiow-Fen Ho, Shinn-Ying BMC Bioinformatics Methodology Article BACKGROUND: Gene Ontology (GO) annotation, which describes the function of genes and gene products across species, has recently been used to predict protein subcellular and subnuclear localization. Existing GO-based prediction methods for protein subcellular localization use the known accession numbers of query proteins to obtain their annotated GO terms. An accurate prediction method for predicting subcellular localization of novel proteins without known accession numbers, using only the input sequence, is worth developing. RESULTS: This study proposes an efficient sequence-based method (named ProLoc-GO) by mining informative GO terms for predicting protein subcellular localization. For each protein, BLAST is used to obtain a homology with a known accession number to the protein for retrieving the GO annotation. A large number n of all annotated GO terms that have ever appeared are then obtained from a large set of training proteins. A novel genetic algorithm based method (named GOmining) combined with a classifier of support vector machine (SVM) is proposed to simultaneously identify a small number m out of the n GO terms as input features to SVM, where m <<n. The m informative GO terms contain the essential GO terms annotating subcellular compartments such as GO:0005634 (Nucleus), GO:0005737 (Cytoplasm) and GO:0005856 (Cytoskeleton). Two existing data sets SCL12 (human protein with 12 locations) and SCL16 (Eukaryotic proteins with 16 locations) with <25% sequence identity are used to evaluate ProLoc-GO which has been implemented by using a single SVM classifier with the m = 44 and m = 60 informative GO terms, respectively. ProLoc-GO using input sequences yields test accuracies of 88.1% and 83.3% for SCL12 and SCL16, respectively, which are significantly better than the SVM-based methods, which achieve < 35% test accuracies using amino acid composition (AAC) with acid pairs and AAC with dipedtide composition. For comparison, ProLoc-GO using known accession numbers of query proteins yields test accuracies of 90.6% and 85.7%, which is also better than Hum-PLoc (85.0%) and Euk-OET-PLoc (83.7%) using ensemble classifiers with hybridization of GO terms and amphiphilic pseudo amino acid composition for SCL12 and SCL16, respectively. CONCLUSION: The growth of Gene Ontology in size and popularity has increased the effectiveness of GO-based features. GOmining can serve as a tool for selecting informative GO terms in solving sequence-based prediction problems. The prediction system using ProLoc-GO with input sequences of query proteins for protein subcellular localization has been implemented (see Availability). BioMed Central 2008-02-01 /pmc/articles/PMC2262056/ /pubmed/18241343 http://dx.doi.org/10.1186/1471-2105-9-80 Text en Copyright © 2008 Huang et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Huang, Wen-Lin
Tung, Chun-Wei
Ho, Shih-Wen
Hwang, Shiow-Fen
Ho, Shinn-Ying
ProLoc-GO: Utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization
title ProLoc-GO: Utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization
title_full ProLoc-GO: Utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization
title_fullStr ProLoc-GO: Utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization
title_full_unstemmed ProLoc-GO: Utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization
title_short ProLoc-GO: Utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization
title_sort proloc-go: utilizing informative gene ontology terms for sequence-based prediction of protein subcellular localization
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2262056/
https://www.ncbi.nlm.nih.gov/pubmed/18241343
http://dx.doi.org/10.1186/1471-2105-9-80
work_keys_str_mv AT huangwenlin prolocgoutilizinginformativegeneontologytermsforsequencebasedpredictionofproteinsubcellularlocalization
AT tungchunwei prolocgoutilizinginformativegeneontologytermsforsequencebasedpredictionofproteinsubcellularlocalization
AT hoshihwen prolocgoutilizinginformativegeneontologytermsforsequencebasedpredictionofproteinsubcellularlocalization
AT hwangshiowfen prolocgoutilizinginformativegeneontologytermsforsequencebasedpredictionofproteinsubcellularlocalization
AT hoshinnying prolocgoutilizinginformativegeneontologytermsforsequencebasedpredictionofproteinsubcellularlocalization