Cargando…
Predicting mostly disordered proteins by using structure-unknown protein data
BACKGROUND: Predicting intrinsically disordered proteins is important in structural biology because they are thought to carry out various cellular functions even though they have no stable three-dimensional structure. We know the structures of far more ordered proteins than disordered proteins. The...
Autores principales: | , , , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2007
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1838436/ https://www.ncbi.nlm.nih.gov/pubmed/17338828 http://dx.doi.org/10.1186/1471-2105-8-78 |
_version_ | 1782132830972674048 |
---|---|
author | Shimizu, Kana Muraoka, Yoichi Hirose, Shuichi Tomii, Kentaro Noguchi, Tamotsu |
author_facet | Shimizu, Kana Muraoka, Yoichi Hirose, Shuichi Tomii, Kentaro Noguchi, Tamotsu |
author_sort | Shimizu, Kana |
collection | PubMed |
description | BACKGROUND: Predicting intrinsically disordered proteins is important in structural biology because they are thought to carry out various cellular functions even though they have no stable three-dimensional structure. We know the structures of far more ordered proteins than disordered proteins. The structural distribution of proteins in nature can therefore be inferred to differ from that of proteins whose structures have been determined experimentally. We know many more protein sequences than we do protein structures, and many of the known sequences can be expected to be those of disordered proteins. Thus it would be efficient to use the information of structure-unknown proteins in order to avoid training data sparseness. We propose a novel method for predicting which proteins are mostly disordered by using spectral graph transducer and training with a huge amount of structure-unknown sequences as well as structure-known sequences. RESULTS: When the proposed method was evaluated on data that included 82 disordered proteins and 526 ordered proteins, its sensitivity was 0.723 and its specificity was 0.977. It resulted in a Matthews correlation coefficient 0.202 points higher than that obtained using FoldIndex, 0.221 points higher than that obtained using the method based on plotting hydrophobicity against the number of contacts and 0.07 points higher than that obtained using support vector machines (SVMs). To examine robustness against training data sparseness, we investigated the correlation between two results obtained when the method was trained on different datasets and tested on the same dataset. The correlation coefficient for the proposed method is 0.14 higher than that for the method using SVMs. When the proposed SGT-based method was compared with four per-residue predictors (VL3, GlobPlot, DISOPRED2 and IUPred (long)), its sensitivity was 0.834 for disordered proteins, which is 0.052–0.523 higher than that of the per-residue predictors, and its specificity was 0.991 for ordered proteins, which is 0.036–0.153 higher than that of the per-residue predictors. The proposed method was also evaluated on data that included 417 partially disordered proteins. It predicted the frequency of disordered proteins to be 1.95% for the proteins with 5%–10% disordered sequences, 1.46% for the proteins with 10%–20% disordered sequences and 16.57% for proteins with 20%–40% disordered sequences. CONCLUSION: The proposed method, which utilizes the information of structure-unknown data, predicts disordered proteins more accurately than other methods and is less affected by training data sparseness. |
format | Text |
id | pubmed-1838436 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2007 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-18384362007-04-04 Predicting mostly disordered proteins by using structure-unknown protein data Shimizu, Kana Muraoka, Yoichi Hirose, Shuichi Tomii, Kentaro Noguchi, Tamotsu BMC Bioinformatics Methodology Article BACKGROUND: Predicting intrinsically disordered proteins is important in structural biology because they are thought to carry out various cellular functions even though they have no stable three-dimensional structure. We know the structures of far more ordered proteins than disordered proteins. The structural distribution of proteins in nature can therefore be inferred to differ from that of proteins whose structures have been determined experimentally. We know many more protein sequences than we do protein structures, and many of the known sequences can be expected to be those of disordered proteins. Thus it would be efficient to use the information of structure-unknown proteins in order to avoid training data sparseness. We propose a novel method for predicting which proteins are mostly disordered by using spectral graph transducer and training with a huge amount of structure-unknown sequences as well as structure-known sequences. RESULTS: When the proposed method was evaluated on data that included 82 disordered proteins and 526 ordered proteins, its sensitivity was 0.723 and its specificity was 0.977. It resulted in a Matthews correlation coefficient 0.202 points higher than that obtained using FoldIndex, 0.221 points higher than that obtained using the method based on plotting hydrophobicity against the number of contacts and 0.07 points higher than that obtained using support vector machines (SVMs). To examine robustness against training data sparseness, we investigated the correlation between two results obtained when the method was trained on different datasets and tested on the same dataset. The correlation coefficient for the proposed method is 0.14 higher than that for the method using SVMs. When the proposed SGT-based method was compared with four per-residue predictors (VL3, GlobPlot, DISOPRED2 and IUPred (long)), its sensitivity was 0.834 for disordered proteins, which is 0.052–0.523 higher than that of the per-residue predictors, and its specificity was 0.991 for ordered proteins, which is 0.036–0.153 higher than that of the per-residue predictors. The proposed method was also evaluated on data that included 417 partially disordered proteins. It predicted the frequency of disordered proteins to be 1.95% for the proteins with 5%–10% disordered sequences, 1.46% for the proteins with 10%–20% disordered sequences and 16.57% for proteins with 20%–40% disordered sequences. CONCLUSION: The proposed method, which utilizes the information of structure-unknown data, predicts disordered proteins more accurately than other methods and is less affected by training data sparseness. BioMed Central 2007-03-06 /pmc/articles/PMC1838436/ /pubmed/17338828 http://dx.doi.org/10.1186/1471-2105-8-78 Text en Copyright © 2007 Shimizu et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Methodology Article Shimizu, Kana Muraoka, Yoichi Hirose, Shuichi Tomii, Kentaro Noguchi, Tamotsu Predicting mostly disordered proteins by using structure-unknown protein data |
title | Predicting mostly disordered proteins by using structure-unknown protein data |
title_full | Predicting mostly disordered proteins by using structure-unknown protein data |
title_fullStr | Predicting mostly disordered proteins by using structure-unknown protein data |
title_full_unstemmed | Predicting mostly disordered proteins by using structure-unknown protein data |
title_short | Predicting mostly disordered proteins by using structure-unknown protein data |
title_sort | predicting mostly disordered proteins by using structure-unknown protein data |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1838436/ https://www.ncbi.nlm.nih.gov/pubmed/17338828 http://dx.doi.org/10.1186/1471-2105-8-78 |
work_keys_str_mv | AT shimizukana predictingmostlydisorderedproteinsbyusingstructureunknownproteindata AT muraokayoichi predictingmostlydisorderedproteinsbyusingstructureunknownproteindata AT hiroseshuichi predictingmostlydisorderedproteinsbyusingstructureunknownproteindata AT tomiikentaro predictingmostlydisorderedproteinsbyusingstructureunknownproteindata AT noguchitamotsu predictingmostlydisorderedproteinsbyusingstructureunknownproteindata |