Cargando…
Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition
BACKGROUND: Existing methods for predicting protein solubility on overexpression in Escherichia coli advance performance by using ensemble classifiers such as two-stage support vector machine (SVM) based classifiers and a number of feature types such as physicochemical properties, amino acid and dip...
Autores principales: | , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2012
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3521471/ https://www.ncbi.nlm.nih.gov/pubmed/23282103 http://dx.doi.org/10.1186/1471-2105-13-S17-S3 |
_version_ | 1782252961677705216 |
---|---|
author | Huang, Hui-Ling Charoenkwan, Phasit Kao, Te-Fen Lee, Hua-Chin Chang, Fang-Lin Huang, Wen-Lin Ho, Shinn-Jang Shu, Li-Sun Chen, Wen-Liang Ho, Shinn-Ying |
author_facet | Huang, Hui-Ling Charoenkwan, Phasit Kao, Te-Fen Lee, Hua-Chin Chang, Fang-Lin Huang, Wen-Lin Ho, Shinn-Jang Shu, Li-Sun Chen, Wen-Liang Ho, Shinn-Ying |
author_sort | Huang, Hui-Ling |
collection | PubMed |
description | BACKGROUND: Existing methods for predicting protein solubility on overexpression in Escherichia coli advance performance by using ensemble classifiers such as two-stage support vector machine (SVM) based classifiers and a number of feature types such as physicochemical properties, amino acid and dipeptide composition, accompanied with feature selection. It is desirable to develop a simple and easily interpretable method for predicting protein solubility, compared to existing complex SVM-based methods. RESULTS: This study proposes a novel scoring card method (SCM) by using dipeptide composition only to estimate solubility scores of sequences for predicting protein solubility. SCM calculates the propensities of 400 individual dipeptides to be soluble using statistic discrimination between soluble and insoluble proteins of a training data set. Consequently, the propensity scores of all dipeptides are further optimized using an intelligent genetic algorithm. The solubility score of a sequence is determined by the weighted sum of all propensity scores and dipeptide composition. To evaluate SCM by performance comparisons, four data sets with different sizes and variation degrees of experimental conditions were used. The results show that the simple method SCM with interpretable propensities of dipeptides has promising performance, compared with existing SVM-based ensemble methods with a number of feature types. Furthermore, the propensities of dipeptides and solubility scores of sequences can provide insights to protein solubility. For example, the analysis of dipeptide scores shows high propensity of α-helix structure and thermophilic proteins to be soluble. CONCLUSIONS: The propensities of individual dipeptides to be soluble are varied for proteins under altered experimental conditions. For accurately predicting protein solubility using SCM, it is better to customize the score card of dipeptide propensities by using a training data set under the same specified experimental conditions. The proposed method SCM with solubility scores and dipeptide propensities can be easily applied to the protein function prediction problems that dipeptide composition features play an important role. AVAILABILITY: The used datasets, source codes of SCM, and supplementary files are available at http://iclab.life.nctu.edu.tw/SCM/. |
format | Online Article Text |
id | pubmed-3521471 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2012 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-35214712012-12-14 Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition Huang, Hui-Ling Charoenkwan, Phasit Kao, Te-Fen Lee, Hua-Chin Chang, Fang-Lin Huang, Wen-Lin Ho, Shinn-Jang Shu, Li-Sun Chen, Wen-Liang Ho, Shinn-Ying BMC Bioinformatics Proceedings BACKGROUND: Existing methods for predicting protein solubility on overexpression in Escherichia coli advance performance by using ensemble classifiers such as two-stage support vector machine (SVM) based classifiers and a number of feature types such as physicochemical properties, amino acid and dipeptide composition, accompanied with feature selection. It is desirable to develop a simple and easily interpretable method for predicting protein solubility, compared to existing complex SVM-based methods. RESULTS: This study proposes a novel scoring card method (SCM) by using dipeptide composition only to estimate solubility scores of sequences for predicting protein solubility. SCM calculates the propensities of 400 individual dipeptides to be soluble using statistic discrimination between soluble and insoluble proteins of a training data set. Consequently, the propensity scores of all dipeptides are further optimized using an intelligent genetic algorithm. The solubility score of a sequence is determined by the weighted sum of all propensity scores and dipeptide composition. To evaluate SCM by performance comparisons, four data sets with different sizes and variation degrees of experimental conditions were used. The results show that the simple method SCM with interpretable propensities of dipeptides has promising performance, compared with existing SVM-based ensemble methods with a number of feature types. Furthermore, the propensities of dipeptides and solubility scores of sequences can provide insights to protein solubility. For example, the analysis of dipeptide scores shows high propensity of α-helix structure and thermophilic proteins to be soluble. CONCLUSIONS: The propensities of individual dipeptides to be soluble are varied for proteins under altered experimental conditions. For accurately predicting protein solubility using SCM, it is better to customize the score card of dipeptide propensities by using a training data set under the same specified experimental conditions. The proposed method SCM with solubility scores and dipeptide propensities can be easily applied to the protein function prediction problems that dipeptide composition features play an important role. AVAILABILITY: The used datasets, source codes of SCM, and supplementary files are available at http://iclab.life.nctu.edu.tw/SCM/. BioMed Central 2012-12-07 /pmc/articles/PMC3521471/ /pubmed/23282103 http://dx.doi.org/10.1186/1471-2105-13-S17-S3 Text en Copyright ©2012 Huang et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Proceedings Huang, Hui-Ling Charoenkwan, Phasit Kao, Te-Fen Lee, Hua-Chin Chang, Fang-Lin Huang, Wen-Lin Ho, Shinn-Jang Shu, Li-Sun Chen, Wen-Liang Ho, Shinn-Ying Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition |
title | Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition |
title_full | Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition |
title_fullStr | Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition |
title_full_unstemmed | Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition |
title_short | Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition |
title_sort | prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition |
topic | Proceedings |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3521471/ https://www.ncbi.nlm.nih.gov/pubmed/23282103 http://dx.doi.org/10.1186/1471-2105-13-S17-S3 |
work_keys_str_mv | AT huanghuiling predictionandanalysisofproteinsolubilityusinganovelscoringcardmethodwithdipeptidecomposition AT charoenkwanphasit predictionandanalysisofproteinsolubilityusinganovelscoringcardmethodwithdipeptidecomposition AT kaotefen predictionandanalysisofproteinsolubilityusinganovelscoringcardmethodwithdipeptidecomposition AT leehuachin predictionandanalysisofproteinsolubilityusinganovelscoringcardmethodwithdipeptidecomposition AT changfanglin predictionandanalysisofproteinsolubilityusinganovelscoringcardmethodwithdipeptidecomposition AT huangwenlin predictionandanalysisofproteinsolubilityusinganovelscoringcardmethodwithdipeptidecomposition AT hoshinnjang predictionandanalysisofproteinsolubilityusinganovelscoringcardmethodwithdipeptidecomposition AT shulisun predictionandanalysisofproteinsolubilityusinganovelscoringcardmethodwithdipeptidecomposition AT chenwenliang predictionandanalysisofproteinsolubilityusinganovelscoringcardmethodwithdipeptidecomposition AT hoshinnying predictionandanalysisofproteinsolubilityusinganovelscoringcardmethodwithdipeptidecomposition |