Cargando…

Supervised learning method for the prediction of subcellular localization of proteins using amino acid and amino acid pair composition

BACKGROUND: Occurrence of protein in the cell is an important step in understanding its function. It is highly desirable to predict a protein's subcellular locations automatically from its sequence. Most studied methods for prediction of subcellular localization of proteins are signal peptides,...

Descripción completa

Detalles Bibliográficos
Autores principales: Habib, Tanwir, Zhang, Chaoyang, Yang, Jack Y, Yang, Mary Qu, Deng, Youping
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2386058/
https://www.ncbi.nlm.nih.gov/pubmed/18366605
http://dx.doi.org/10.1186/1471-2164-9-S1-S16
_version_ 1782155202507309056
author Habib, Tanwir
Zhang, Chaoyang
Yang, Jack Y
Yang, Mary Qu
Deng, Youping
author_facet Habib, Tanwir
Zhang, Chaoyang
Yang, Jack Y
Yang, Mary Qu
Deng, Youping
author_sort Habib, Tanwir
collection PubMed
description BACKGROUND: Occurrence of protein in the cell is an important step in understanding its function. It is highly desirable to predict a protein's subcellular locations automatically from its sequence. Most studied methods for prediction of subcellular localization of proteins are signal peptides, the location by sequence homology, and the correlation between the total amino acid compositions of proteins. Taking amino-acid composition and amino acid pair composition into consideration helps improving the prediction accuracy. RESULTS: We constructed a dataset of protein sequences from SWISS-PROT database and segmented them into 12 classes based on their subcellular locations. SVM modules were trained to predict the subcellular location based on amino acid composition and amino acid pair composition. Results were calculated after 10-fold cross validation. Radial Basis Function (RBF) outperformed polynomial and linear kernel functions. Total prediction accuracy reached to 71.8% for amino acid composition and 77.0% for amino acid pair composition. In order to observe the impact of number of subcellular locations we constructed two more datasets of nine and five subcellular locations. Total accuracy was further improved to 79.9% and 85.66%. CONCLUSIONS: A new SVM based approach is presented based on amino acid and amino acid pair composition. Result shows that data simulation and taking more protein features into consideration improves the accuracy to a great extent. It was also noticed that the data set needs to be crafted to take account of the distribution of data in all the classes.
format Text
id pubmed-2386058
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-23860582008-05-15 Supervised learning method for the prediction of subcellular localization of proteins using amino acid and amino acid pair composition Habib, Tanwir Zhang, Chaoyang Yang, Jack Y Yang, Mary Qu Deng, Youping BMC Genomics Research BACKGROUND: Occurrence of protein in the cell is an important step in understanding its function. It is highly desirable to predict a protein's subcellular locations automatically from its sequence. Most studied methods for prediction of subcellular localization of proteins are signal peptides, the location by sequence homology, and the correlation between the total amino acid compositions of proteins. Taking amino-acid composition and amino acid pair composition into consideration helps improving the prediction accuracy. RESULTS: We constructed a dataset of protein sequences from SWISS-PROT database and segmented them into 12 classes based on their subcellular locations. SVM modules were trained to predict the subcellular location based on amino acid composition and amino acid pair composition. Results were calculated after 10-fold cross validation. Radial Basis Function (RBF) outperformed polynomial and linear kernel functions. Total prediction accuracy reached to 71.8% for amino acid composition and 77.0% for amino acid pair composition. In order to observe the impact of number of subcellular locations we constructed two more datasets of nine and five subcellular locations. Total accuracy was further improved to 79.9% and 85.66%. CONCLUSIONS: A new SVM based approach is presented based on amino acid and amino acid pair composition. Result shows that data simulation and taking more protein features into consideration improves the accuracy to a great extent. It was also noticed that the data set needs to be crafted to take account of the distribution of data in all the classes. BioMed Central 2008-03-20 /pmc/articles/PMC2386058/ /pubmed/18366605 http://dx.doi.org/10.1186/1471-2164-9-S1-S16 Text en Copyright © 2008 Habib et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Habib, Tanwir
Zhang, Chaoyang
Yang, Jack Y
Yang, Mary Qu
Deng, Youping
Supervised learning method for the prediction of subcellular localization of proteins using amino acid and amino acid pair composition
title Supervised learning method for the prediction of subcellular localization of proteins using amino acid and amino acid pair composition
title_full Supervised learning method for the prediction of subcellular localization of proteins using amino acid and amino acid pair composition
title_fullStr Supervised learning method for the prediction of subcellular localization of proteins using amino acid and amino acid pair composition
title_full_unstemmed Supervised learning method for the prediction of subcellular localization of proteins using amino acid and amino acid pair composition
title_short Supervised learning method for the prediction of subcellular localization of proteins using amino acid and amino acid pair composition
title_sort supervised learning method for the prediction of subcellular localization of proteins using amino acid and amino acid pair composition
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2386058/
https://www.ncbi.nlm.nih.gov/pubmed/18366605
http://dx.doi.org/10.1186/1471-2164-9-S1-S16
work_keys_str_mv AT habibtanwir supervisedlearningmethodforthepredictionofsubcellularlocalizationofproteinsusingaminoacidandaminoacidpaircomposition
AT zhangchaoyang supervisedlearningmethodforthepredictionofsubcellularlocalizationofproteinsusingaminoacidandaminoacidpaircomposition
AT yangjacky supervisedlearningmethodforthepredictionofsubcellularlocalizationofproteinsusingaminoacidandaminoacidpaircomposition
AT yangmaryqu supervisedlearningmethodforthepredictionofsubcellularlocalizationofproteinsusingaminoacidandaminoacidpaircomposition
AT dengyouping supervisedlearningmethodforthepredictionofsubcellularlocalizationofproteinsusingaminoacidandaminoacidpaircomposition